Reinforcement Learning
 - distinct from supervised & unsupervised learning
 - try to figure out a good "policy" where a
    - policy is a mapping from state (eg state of sensors & recent
      history, or location) to action, and
    - "good" means leading to maximum FUTURE reward, not just
      immediate reward
 - note that actions influence not just immediate reward but also
   future state
 - example: playing solitaire.

Two main algorithms: Policy-Value Iteration and Q-Learning


 s in S : state of system

 a in A : action (really A(s), ie available actions are a function of
    state, but we suppress this in the notation)

 T(s,a) is the state we end up in starting in state s and taking action a.

 pi : S -> A,
  is a policy, maps state to action.  Might be stochastic.

 r(t) : immediate reward obtained at time t.
 r(s,a) : immediate reward obtained by taking action a in state s.

We can define a "discount factor" by which we discount future reward
as compared to immediate reward.  Call this alpha, with 0<=alpha<1.

 R(t) = sum_{i >= 0} alpha^i r(t+i)
as the "amortized reward" at time t.  Typically we are interested in
finding a policy that maximises the "expected amortized reward".

Note that
 R(t) = r(t) + alpha R(t+1)

ALGORITHM: Policy-Value iteration

  - define a "value function" V(s) which is the expected amortized
    reward following the current policy pi starting at state s.
  - freeze pi and estimate V (by following pi and measuring the
    result), if in state s and take action a and get reward r and
    end up in state s' this is
     V(s) <- r + alpha V(s')
  - freeze V and make a new policy:
     pi(s) = argmax_a  r(s,a) + V( T(s,a) )
  - iterate until convergence

 This works: it converges to the optimal policy, called pi*.

(Notation: we write "var <- val" to mean "update var to be closer to
 val".  Can move val completely there if everything is deterministic, as
 in a chance-free game like pegs-on-a-board.  Or can update it some
 fraction of the way, if there are stochastic factors.  Must then
 repeat until things settle down.)

  - need to know functions r(s,a) and T(s,a).
  - need to follow "current" policy pi in order to figure out V.  This
    means cannot do "what if" exploration.

 ALGORITHM: Q-Learning

  idea: solve the problems of P/V-iteration by instead defining a

    Q(s,a) = expected amortized reward is we start in state s, take
    action a, and thereafter follow the optimal policy pi*.

  This has a few big advantages:

   - don't need to keep "current policy" around, can always choose
     best action from state s as argmax_a Q(s,a), or randomly, or

   - can update Q regardless of policy being followed.  If in state s
     and take action a, ending up in state s', and receiving immediate
     reward r, can update:

      Q(s,a) <- r + alpha max_a' Q(s',a')