Reinforcement Learning Lane Change Algorithm (Part 1)

Jacob Ouyang
Good Audience
Published in
9 min readOct 16, 2018

--

Introduction:

In the last article, I talked about the background for my project and the overall structure these articles will take. In this article, I’ll be explaining how the agent I created for Phase 2 work as well giving some insight into underlying algorithms.

Phase 2 Background:

From the first phase, I adopted an environment with limited randomness. Every lane had a set velocity, but the gap between cars are now varied. Each car had a variable distance of anywhere between 6 and 20 units of another car.

As a result, every time the environment restarts, the cars appear in a different position, but every car in each lane behaves the same. This way, the agent has to learn a simple strategy instead of being able to memorize a set order of moves in a deterministic environment. Before we move onto my agent, we should first go over what deep reinforcement learning actually is and the specific algorithm that I chose to implement for this phase.

Deep Reinforcement Learning:

Reinforcement learning is a method of machine learning that focuses on training an agent (the AI player) to do a task such as playing a game by letting it perform actions and seeing the consequences of those actions. The reinforcement learning agent starts off with little to limited knowledge of the environment it exists in, and over the course of playing many rounds, updates its own algorithm (called policy) to play with better and better moves.

Unlike other methods of machine learning such as supervised learning, reinforcement learning “generate its own dataset including the labels” to learn with through techniques such as self play. Algorithms try to maximize some sort of cumulative future rewards. Self play proceeds like this, very much like playing a chess game. A play step is that given the agent is situated at a certain environment captured as input frame; the agent moves to the next frame generated through playing the agent’s current policy from the input frame. If the agent dies on the new frame, then the new frame is judged negatively, and inversely if the agent does something good in the new frame, then then new frame is positively rewarded. If no consequence can be judged on the new frame, then a neutral score is assigned to the new frame. In case of the agent dies in a frame, the play sequence ends and the agent is given a new game (such as a chess board resetting) and starts a new sequence. Otherwise, the agent continues to play another step. After many rounds, self playing will generate enough frames under current policy. Then a learning algorithm will kick in to learn an improved policy. This process iterates over the self playing and learning.

Deep reinforcement learning specifically refers to the method of reinforcement learning that uses neural networks to model its prediction functions. Because of how prominent neural networks are in AI right now, people have started using neural networks as the “brain” or the interpreter in the picture above for the agent as it performs actions. The neural network will take in an input such as the current game frame, and output an action or a score that will determine how the agent will react in a certain situation.

Now that we’ve established the basic idea of reinforcement learning, we can go through the algorithm that I chose to use for the next two phases, called policy gradient.

Policy Gradient:

Policy gradient is a reinforcement learning algorithm which tries to learn the policy function P(A|S) directly. The policy function returns a probability distribution from a given state action pair (S, A). As an example, we can consider a game of Pong. Given a certain frame (a state of the game), the agent has two options: to go up or go down. In a policy gradient algorithm, the agent will compute the probabilities of going up or down depending on which one the algorithm deems more favorable.

The agent learns by playing the game over and over again, and at the end of an episode (either at the end of one game or at the other), a discounted reward function (more details on this later) is applied to all frames in that episode and applies a reward to each frame based on the result of the game. Then, periodically the agent will update the policy function (our neural network) using the new episodes as its dataset.

Now that we’ve gone through the basic idea for policy gradient, I’ll be going over my implementation of it to solve the next two phases of my project.

How It Works:

Preprocessing

In order to make an image that contained information for both the position and the velocity of the cars around the player, I needed to overcome the limits of a single frame, because it lacks velocity data, only position information. Having multiple frames as input would take up much more processing power and time to train. As a result, I took the difference of two images to give the reinforcement learning agent with both necessary components needed to make an informative decision, a differential frame.

The player car is invisible (in that differential frame) since its position relative to the picture frame is constant, but the little white shapes show how fast the cars around it move relative to the player car, giving an accurate sense of position and velocity. Despite this seemingly incomprehensible frame, the differential frame actually worked better than the other input style we chose which ranged from 2 consecutive frames (which dramatically slowed down training time) to single frame input.

Furthermore, I subsampled the image to 50x100 from 200x300 by taking every 4th pixel vertically and every 3rd pixel horizontally. This was to make training faster and reduce CPU memory footprint. Since the inputs cars were just rectangular hitboxes, it didn’t matter if the original image lost some detail since the overall gist of the image is still there.

Reward Function

This is where things get tricky. In conventional reinforcement learning problems such as chess or Atari games, an end game is clearly categorized. You either win or lose the game, and thus you can apply a +1 -1 reward that clearly encases the game’s result. We can then apply the Bellman equation to reward the rest of the frames and establish discounted rewards.

However, with the idea of going as “fast as possible,” a +1 -1 reward system is not enough to incentivize the car to find the most optimal route. In fact, given a +1 -1, system, the car would always find the local optimum, choosing to go straight forever, without any incentive to lane change, since this would yield a 0% crash rate.

In order to counteract this situation, two reward functions are needed, which we will refer to as smallreward and bigreward.

For this smallreward, a change in velocity would give the given frame a reward of change in velocity *2. This way, the agent would learn much quicker that if it either lane changed into a faster lane or accelerated without crashing, the frame would be considered “somewhat good” at the very least.

The bigreward, on the other hand, was used to reward the entire episode based on the outcome of the game and was applied to the last frame of the episode. As a result of wanting to punish or reward the agent based on distance run (if it crashes) and average speed, I came up with the functions:

if x<0: x= x*3^((170-β)/170) (for when the agent crashes)

if x>0: x = (177-β)/30 * 2(for when the agent succeeds).

Beta represents the amount of frames in the episode, and 177 is the max amount of frames possible in a certain episode.

The Bellman equation of R=r0+Ɣ¹*r_1+Ɣ²*r_2+Ɣ³*r_3… would then be applied to the bigreward with a gamma of 0.99 to reward the entire episode. This way, every consecutive frame in the sequence would be rewarded by the equation x_{n-1}=x_n establishing a discounted rewards system.

Epsilon Greedy

One of the first problems I had with this new environment was the lack of exploration the agent took during self play. Even with the modified reward function above, the agent would either constantly collide with other cars or find the local optima of going in a straight line. As a result, exploration was needed to force the agent to avoid immediate convergence and find an optimal policy.

In my case, I chose to use epsilon-greedy as my exploration method. In this method, a decaying variable epsilon ranges from 1 to 0, and decreases as the agent learns. The agent will then choose whether or not to follow the policy when it chooses a move during a game, or it randomly choose an action based on the epsilon. For example, if epsilon was 0.3 at the time, then the agent has a 0.3 probability of randomly selecting an action, and 0.7 probability of choosing the action that the policy says is more optimal. By doing this, you are essentially forcing the agent to explore the space by ignoring the policy. As a result, the agent is explored to new moves regardless of whether or not it believes it has converged upon an optimal policy. The point of training isn’t to find the “easiest solution,” but to find the “best solution,” and this algorithm helps the agent explore different paths.

For my agent, I used an epsilon of ε=2^(-Ɣ/1000), where gamma is the episode number. This way, every 1000 episodes, when the model is actually stored, the agent will be half as random as the previous.

Neural Network

The agent uses policy gradient method to update a neural network modelling policy P(A|S) that returns a probability distribution from a given state action pair. I choose to use a 2 layer fully connected neural network because since the state of the game was relatively simple, I didn’t think that adding layers would contribute to that big of a difference in performance. In fact, in trials with additional fully connected layers or convolutional layers, the network either performed worse due to longer training times or converging to suboptimal policies. Our input is just our 200 x 300 differential frame (cropped to 50x100) reshaped into a 1d array, and there’s a hidden layer with 200 nodes between the input and the output layer of 4, one for each action. The weights of the hidden layer are initialized to a random number based on a normal distribution with a standard deviation of 0.1. Every two game episodes, up to 350 frames are compiled to keep within available memory limit, with the frames with lowest absolute value rewards removed as excess.

Conclusion

This algorithm was my first adventure into reinforcement learning and taught me alot about training agents and how to structure reward functions in a manner that helps the agent learn. Furthermore, I also learned some basic data preprocessing which helped speed up the training process. In the next post, I’ll be using a more complicated reinforcement learning algorithm (DDQN) to solve an environment with significantly more randomness. This is the code for phase 2 (the code’s really unpolished, and I’ll come back to it when I have more time). Thanks for reading!!

Part 0

Part 2

Connect with the Raven team on Telegram

--

--