machine-learning,reinforcement-learning,sarsa
It's unfortunate that they've reused the variables s and a in two different scopes here, but yes, you adjust all e(s,a) values, e.g., for every state s in your state space for every action a in your action space update Q(s,a) update e(s,a) Note what's happening here. e(s,a) is getting...
Why do we subtract Q(a,s)? r + DQ(a',s1) is the reward that we got on this run through from getting to state s by taking action a. In theory, this is the value that Q(a,s) should be set to. However, we won't always take the same action after getting to...
machine-learning,reinforcement-learning,sarsa
Summary: your current approach is correct, except that you shouldn't restrict your output values to be between 0 and 1. This page has a great explanation, which I will summarize here. It doesn't specifically discuss SARSA, but I think everything it says should translate. The values in the results vector...
machine-learning,reinforcement-learning,sarsa
I agree with you 100%. Failing to reset the e-matrix at the start of every episode has exactly the problems that you describe. As far as I can tell, this is an error in the pseudocode. The reference that you cite is very popular, so the error has been propagated...