Posts

Q-Learning and SARSA

In this post I’m going to be covering two active reinforcement learning methods: q-learning and SARSA. Both methods do not depend on an MDP. Meaning, we are not guaranteed the presence of P. Instead, we learn which actions can be taken in a state during playthroughs. This is represented by a q-table, which is a table of states and actions that yield a q-value, which represents expected utility when taking an action in a given state. We can say that the expected utility of a state is the best action associated with that state. See the equation below. ...

Temporal Difference Learning

This post is on temporal difference learning (TDL). TDL does not rely on a a Markov Decision Process (MDP). Instead, it traditionally uses a table where each state has a row and column so that the table represents the utility of all possible transitions. TDL uses observations to learn which transitions are valid. \[ \begin{align} U^{\pi}(s) \leftarrow U^{\pi} + \alpha (R(s) + \gamma U^{\pi}(s') - U^{\pi}(s)) \end{align} \]This is a very similar to the Bellman equation that we’ve seen in the past posts, but we aren’t using neighbors because temporal difference learning does not assume a model. Instead, an agent plays through the game, and returns a set of states that were taken along the path. So, the utility of a state s is determined by itself plus the reward of that state plus the utility of the state next in the path minus its own utility all multiplied by a learning rate alpha, which is typically a small number between 0 and 1. ...

Policy Iteration

In this post I’ll be covering policy iteration. As a brief reminder, here is the bellman equation that value iteration optimizes: \[ \begin{align} U(s) = R(s) + \gamma \max_{s \in S} \sum_{s'} P(s'|s, a)U(s') \end{align} \]In policy iteration, the goal is not to find the perfectly optimal utility of states like value iteration. If one state is clearly better than another, then the precise difference isn’t that important. After all, we care most about an agent making the correct decisions. This idea is how we come to policy iteration, which is broken into two steps: (1) policy evaluation and (2) policy improvement. Before going into each of these ideas, I want to give you the pseudo-code for the algorithm. ...

Value Iteration

In this post I’ll be covering value iteration. To best understand why it works, I want to first cover the Bellman equation: \[ \begin{align} U(s) = R(s) + \gamma \max_{s \in S} \sum_{s'} P(s'|s, a)U(s') \end{align} \]We’ve already covered R(s), P(s’ | s, a), and U(s’) so I won’t waste your time on those in a previous post. Gamma is a discount constant that is greater than 0 and less than or equal to 1. What this equation is saying is that the utility of state s is the reward plus a discount of the best neighboring state according to the transition probability multiplied by that states utility. Value iteration is the application of this formula over multiple iterations till convergence. ...

Revised Direct Utility Estimation For Better MDP

In the previous post, I provided an implementation that was technically correct but lacking in two ways: (1) optimization and (2) formalism. The optimization was weak, because I was using the function game.next_states() which was computing the next possible states given a state. Instead, precompute all valid transitions and your code will be much more efficient. This also leads to formalism where I had a MDP but I never directly defined the set of actions or transitions. So, let’s do that. ...

Direct Utility Estimation

In this post, I’m going to take you through one approach to solving the grid world environment with reinforcement learning, specifically direct utility estimation (DUE). (If you are unfamiliar with the grid world environment, fret not because that is the section directly below.) Before DUE can be covered, I’ll give a brief overview of Markov decision processes and why they are not of great for problems with large state spaces. I’ll then show how DUE works and can be implemented in Python. Following that, I’ll show that DUE can solve MDPs, but the speed is low enough that a DUE is best seen as introduction for more powerful approaches. ...

Generative Art: Conway's Game of Life

With the lightning algorithm out of the way, let’s look at Conway’s Game of Life. The idea is pretty cool: we create a grid (nxm), and randomly assign cells to alive or dead. We then apply a set of rules for some number of iterations and we get a very hard to predict result. The rules are pretty simple: Any live cell with fewer than two live neighbors dies, as if by under-population. Any live cell with two or three live neighbors lives on to the next generation. Any live cell with more than three live neighbors dies, as if by overpopulation. Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction. It turns out that despite the relative fame of Conway’s Game of Life, implementing it is very simple. I think that there are three points of note. First, you are checking not only the four cardinal directions, you are also checking the corners for a total of 8 cells. Second, Conway’s Game of Life was designed for an infinite grid, but that is not feasible, nor useful is it useful in our case of generating images. In this implementation I’ve made the grid the size of the image. A better way would be to make the grid larger than the image to approximate the larger grid space. Third, the grid is not modified in place (i.e. a new grid is modified in each iteration, and the old grid is used to get neighbor counts). ...

Generative Art: Lightning

In the previous post, we generated images with circles and squares in Python with the package Pillow. The code was simple and the results are not that compelling. So, in this post I’m looking to step it up a bit. A while back Numberphile posted a video of Matt Henderson explaining how maze generation and breadth-first search can be used to generate a compelling animation of lightning. We are going to use a similar approach to generate images of lightning. ...

Generative Art: Circles and Squares

This is the first of what we’ll likely be a series of blog posts on generative art, specifically images. At the start, everything will be very simple. This is mainly because the topic is new to me and the library we’re going to use to generate images is also new to me. Speaking of, we’re going to be using Python and Pillow to generate images; at least at the start, we’ll see where we go. To start, I want to break this post into two parts: (1) pseudo-random and (2) generating a simple image with circles and squares. ...