Terminology

Reinforcement Learning is about learning a mapping from states to a probability distribution over actions. This is called the policy.

Policy = p(s,a) = probability of taking action a when in state s

S = set of all states (assume finite)

s_t = state at time t

A(s_t) = set of all possible actions given agent is in state s_t e S

a_t = action at time t

r_t e R (reals) = reward at time t

At each timestep t=1,2,3,...

the agent finds itself in a state s_t e S and
on that basis chooses an action a_t e A(s_t).
One timestep later, the agent receives a reward r_t+1 and
finds itself in a new state s_t+1.

The return, ret_t, is the total reward received starting at time t+1:

ret_t= r_t+1 + r_t+2 + r_t+3 .... + r_f

where r_f is the reward at the final time step (can be infinite)

and the discounted return is

ret_t= r_t+1 + g r_t+2 + g² r_t+3 ....

where 0 <= g <= 1 is called the discount factor.

We assume that the number of states and actions is finite. We then define the state transition probabilities to be:

This is just the probability of transitioning from state s to state s' when action a has been taken.

Expected Rewards

The value function for policy p is

The action-value function for policy p is

Bellman's Equation for V^p(s) (Recursion on V^p(s)) is

Bellman Optimality Equations

Goal: Find the policy that gives the greatest return over the long run. We say a policy p is better than or equal to policy p' if V^p(s) >= V^p'(s) for all s. There is always at least one such policy. Such a policy it is called an optimal policy and is denoted by p*. Its corresponding value function is called V*:

V*(s) = V^p*(s) = max_p V^p(s) , for all s

and the optimal action-value function

Q*(s,a) = Q^p*(s,a) = max_p Q^p(s,a) , for all s, a

The Bellman optimality equation is then

This equation has a unique solution. It is a system of equations with |S| equations and |S| unknowns. If P and R were known then, in principle, it can be solved using some method for solving systems of nonlinear equations. Once V* is known, the optimal policy is determined by always choosing the action that produces the largest V*.

[Top] [Next: ] [Back to the first page]