Temporal Difference Learning
This post is on temporal difference learning (TDL). TDL does not rely on a a Markov Decision Process (MDP). Instead, it traditionally uses a table where each state has a row and column so that the table represents the utility of all possible transitions. TDL uses observations to learn which transitions are valid. \[ \begin{align} U^{\pi}(s) \leftarrow U^{\pi} + \alpha (R(s) + \gamma U^{\pi}(s') - U^{\pi}(s)) \end{align} \]This is a very similar to the Bellman equation that we’ve seen in the past posts, but we aren’t using neighbors because temporal difference learning does not assume a model. Instead, an agent plays through the game, and returns a set of states that were taken along the path. So, the utility of a state s is determined by itself plus the reward of that state plus the utility of the state next in the path minus its own utility all multiplied by a learning rate alpha, which is typically a small number between 0 and 1. ...