Reinforcement Learning - distinct from supervised & unsupervised learning - try to figure out a good "policy" where a - policy is a mapping from state (eg state of sensors & recent history, or location) to action, and - "good" means leading to maximum FUTURE reward, not just immediate reward - note that actions influence not just immediate reward but also future state - example: playing solitaire. Two main algorithms: Policy-Value Iteration and Q-Learning Definitions: s in S : state of system a in A : action (really A(s), ie available actions are a function of state, but we suppress this in the notation) T(s,a) is the state we end up in starting in state s and taking action a. pi : S -> A, is a policy, maps state to action. Might be stochastic. r(t) : immediate reward obtained at time t. or r(s,a) : immediate reward obtained by taking action a in state s. We can define a "discount factor" by which we discount future reward as compared to immediate reward. Call this alpha, with 0<=alpha<1. Define R(t) = sum_{i >= 0} alpha^i r(t+i) as the "amortized reward" at time t. Typically we are interested in finding a policy that maximises the "expected amortized reward". Note that R(t) = r(t) + alpha R(t+1) ALGORITHM: Policy-Value iteration idea: - define a "value function" V(s) which is the expected amortized reward following the current policy pi starting at state s. - freeze pi and estimate V (by following pi and measuring the result), if in state s and take action a and get reward r and end up in state s' this is V(s) <- r + alpha V(s') - freeze V and make a new policy: pi(s) = argmax_a r(s,a) + V( T(s,a) ) - iterate until convergence This works: it converges to the optimal policy, called pi*. (Notation: we write "var <- val" to mean "update var to be closer to val". Can move val completely there if everything is deterministic, as in a chance-free game like pegs-on-a-board. Or can update it some fraction of the way, if there are stochastic factors. Must then repeat until things settle down.) Problems: - need to know functions r(s,a) and T(s,a). - need to follow "current" policy pi in order to figure out V. This means cannot do "what if" exploration. ALGORITHM: Q-Learning idea: solve the problems of P/V-iteration by instead defining a function Q(s,a) = expected amortized reward is we start in state s, take action a, and thereafter follow the optimal policy pi*. This has a few big advantages: - don't need to keep "current policy" around, can always choose best action from state s as argmax_a Q(s,a), or randomly, or whatever. - can update Q regardless of policy being followed. If in state s and take action a, ending up in state s', and receiving immediate reward r, can update: Q(s,a) <- r + alpha max_a' Q(s',a')