reinforcement-learning,q-learning

Q1: It will converge to a single mapping, unless more than one mapping is optimal. Q2: Q-Learning has an exploration parameter that determines how often it takes random, potentially sub-optimal moves. Rewards will fluctuate as long as this parameter is non-zero. Q3: Reward graphs, as in the link you provided....

machine-learning,reinforcement-learning,sarsa

I agree with you 100%. Failing to reset the e-matrix at the start of every episode has exactly the problems that you describe. As far as I can tell, this is an error in the pseudocode. The reference that you cite is very popular, so the error has been propagated...

python,machine-learning,reinforcement-learning,function-approximation

you need normalization in each trial. This will keep the weights in a bounded range. (e.g. [0,1]). They way you are adding the weights each time, just grows the weights and it would be useless after the first trial. I would do something like this: self.weights[i] += (self.alpha * theta...

python-2.7,if-statement,random,reinforcement-learning

After the suggestions to use numpy I did a bit of research and came with this solution for the first part of the soft-max implementation. prob_t = [0,0,0] #initialise for a in range(nActions): prob_t[a] = np.exp(Q[state][a]/temperature) #calculate numerators #numpy matrix element-wise division for denominator (sum of numerators) prob_t = np.true_divide(prob_t,sum(prob_t))...

c++,algorithm,dynamic-programming,reinforcement-learning

Let's reconsider the Bellman optimality equation for your task. I consider this as a systematic approach to such problems (whereas I often don't understand DP one-liners). My reference is the book of Sutton and Barto. The state in which your system is can be described by a triple of integer...

machine-learning,reinforcement-learning,sarsa

Summary: your current approach is correct, except that you shouldn't restrict your output values to be between 0 and 1. This page has a great explanation, which I will summarize here. It doesn't specifically discuss SARSA, but I think everything it says should translate. The values in the results vector...

nlp,nltk,named-entity-recognition,reinforcement-learning

The plain vanilla NER chunker provided in nltk internally uses maximum entropy chunker trained on the ACE corpus. Hence it is not possible to identify dates or time, unless you train it with your own classifier and data(which is quite a meticulous job). You could refer this link for performing...

Yes, do a search on GitHub, and you will get a whole bunch of results: GitHub: WILLIAMS+REINFORCE The most popular ones use this code (in Python): __author__ = 'Thomas Rueckstiess, [email protected]' from pybrain.rl.learners.directsearch.policygradient import PolicyGradientLearner from scipy import mean, ravel, array class Reinforce(PolicyGradientLearner): """ Reinforce is a gradient estimator technique...

machine-learning,reinforcement-learning,sarsa

It's unfortunate that they've reused the variables s and a in two different scopes here, but yes, you adjust all e(s,a) values, e.g., for every state s in your state space for every action a in your action space update Q(s,a) update e(s,a) Note what's happening here. e(s,a) is getting...

artificial-intelligence,neural-network,backpropagation,reinforcement-learning,temporal-difference

The backward and forward views can be confusing, but when you are dealing with something simple like a game-playing program, things are actually pretty simple in practice. I'm not looking at the reference you're using, so let me just provide a general overview. Suppose I have a function approximator like...

machine-learning,reinforcement-learning,q-learning

RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think. I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like...

machine-learning,artificial-intelligence,tic-tac-toe,reinforcement-learning,q-learning

You have a Q value for each state-action pair. You update one Q value after every action you perform. More precisely, if applying action a1 from state s1 gets you into state s2 and brings you some reward r, then you update Q(s1, a1) as follows: Q(s1, a1) = Q(s1,...