Policy Iteration
In this post I’ll be covering policy iteration. As a brief reminder, here is the bellman equation that value iteration optimizes: \[ \begin{align} U(s) = R(s) + \gamma \max_{s \in S} \sum_{s'} P(s'|s, a)U(s') \end{align} \]In policy iteration, the goal is not to find the perfectly optimal utility of states like value iteration. If one state is clearly better than another, then the precise difference isn’t that important. After all, we care most about an agent making the correct decisions. This idea is how we come to policy iteration, which is broken into two steps: (1) policy evaluation and (2) policy improvement. Before going into each of these ideas, I want to give you the pseudo-code for the algorithm. ...