TensorFlow 2 Models: Deep Reinforcement Learning

By / Li Xihan, Google Developers Expert

This article is excerpted from “Simple and Brutal TensorFlow 2”, reply “Manual” to get the collection.

TensorFlow 2 Models: Deep Reinforcement Learning

It should have been introduced long ago, the deep reinforcement learning in TensorFlow, yes, it is finally done!

This article will introduce the process of implementing the Q-learning algorithm using TensorFlow in the OpenAI gym environment to play the CartPole game.

Deep Reinforcement Learning (DRL)

Reinforcement learning (RL) emphasizes how to act based on the environment to maximize expected benefits. Reinforcement learning combined with deep learning technology (Deep Reinforcement Learning, DRL) is even more powerful. The well-known AlphaGo in recent years is a typical application of deep reinforcement learning.

Reinforcement Learninghttps://zh.wikipedia.org/wiki/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0

“

Note You can refer to the introduction to reinforcement learning in the appendix of the manual for basic knowledge of reinforcement learning.https://tf.wiki/zh/appendix/rl.html

Here, we use deep reinforcement learning to play the CartPole game. The CartPole is a classic problem in control theory, where a pole is connected to a small cart by an axis, and the center of gravity of the pole is above the axis, making it an unstable system. Under the influence of gravity, the pole can easily fall. We need to control the cart to move left and right on a horizontal track to keep the pole in a vertical balance state.

CartPole Game

We use the CartPole game environment in the Gym environment library launched by OpenAI, which can be installed using pip install gym. For specific installation steps and tutorials, refer to the official documentation and here. The interaction process with Gym is similar to a turn-based game. We first obtain the initial state of the game (such as the initial angle of the pole and the position of the cart), and then in each turn t, we need to choose one of the current available actions to execute with Gym (for example, push the cart left or right, where only one of the two can be chosen in each turn). After executing the action, Gym returns the next state and the reward value obtained in the current turn (for example, if we choose to push the cart left and execute it, the cart’s position moves further left, and the pole’s angle moves further right, Gym returns the new angle and position to us. If the pole does not fall in this turn, Gym also returns a small positive reward). This process can continue iterating until the game ends (for example, the pole falls). In Python, the basic calling method of Gym is as follows:

import gym

env = gym.make('CartPole-v1')       # Instantiate a game environment, parameter is the game name
state = env.reset()                 # Initialize the environment and get the initial state
while  True:
    env.render()                    # Render the current frame, draw to the screen
    action = model.predict(state)   # Assume we have a trained model that can predict the action based on the current state
    next_state, reward, done, info = env.step(action)   # Let the environment execute the action, obtaining the next state after the action, the reward of the action, whether the game is over, and additional information
    if done:                        # Exit the loop if the game is over
        break

Gym environment library launched by OpenAIhttps://gym.openai.com/
Official documentationhttps://gym.openai.com/docs/
Herehttps://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/4-4-gym/

Thus, our task is to train a model that can predict a good action based on the current state. Roughly speaking, a good action should maximize the total reward obtained during the entire game process, which is also the goal of reinforcement learning. Taking the CartPole game as an example, our goal is to make appropriate actions to keep the pole from falling, that is, to maximize the number of turns of game interaction as much as possible. Each time a turn is made, we receive a small positive reward, and the more turns, the higher the accumulated reward value. Therefore, maximizing the total reward during the game process is consistent with our ultimate goal.

The following code demonstrates how to use the Deep Q-Learning method in deep reinforcement learning [Mnih2013] to train the model. First, we import TensorFlow, Gym, and some common libraries, and define some model hyperparameters:

import tensorflow as tf
import numpy as np
import gym
import random
from collections import deque

num_episodes = 500              # Total number of episodes for game training
num_exploration_episodes = 100  # Number of episodes for the exploration process
max_len_episode = 1000          # Maximum number of turns for each episode
batch_size = 32                 # Batch size
learning_rate = 1e-3            # Learning rate
gamma = 1.                      # Discount factor
initial_epsilon = 1.            # Exploration rate at the start of exploration
final_epsilon = 0.01            # Exploration rate at the end of exploration

Then, we use tf.keras.Model to build a Q function network (Q-network) to fit the Q function in Q Learning. Here we use a simpler fully connected neural network for fitting. The network takes the current state as input and outputs the Q-value for each action (for CartPole, it is 2-dimensional, i.e., pushing the cart left and right).

class QNetwork(tf.keras.Model):
   def __init__(self):
       super().__init__()
       self.dense1 = tf.keras.layers.Dense(units=24, activation=tf.nn.relu)
       self.dense2 = tf.keras.layers.Dense(units=24, activation=tf.nn.relu)
       self.dense3 = tf.keras.layers.Dense(units=2)

   def call(self, inputs):
       x = self.dense1(inputs)
       x = self.dense2(x)
       x = self.dense3(x)
       return x

   def predict(self, inputs):
       q_values = self(inputs)
       return tf.argmax(q_values, axis=-1)

[Mnih2013]http://arxiv.org/abs/1312.5602

Finally, we implement the Q Learning algorithm in the main program.

if __name__ == '__main__':
   env = gym.make('CartPole-v1')       # Instantiate a game environment, parameter is the game name
   model = QNetwork()
   optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
   replay_buffer = deque(maxlen=10000) # Use a deque as the experience replay pool for Q Learning
epsilon = initial_epsilon
for episode_id in range(num_episodes):
       state = env.reset()             # Initialize the environment and get the initial state
       epsilon = max(                  # Calculate the current exploration rate
           initial_epsilon * (num_exploration_episodes - episode_id) / num_exploration_episodes,
           final_epsilon)
       for t in range(max_len_episode):
           env.render()                                # Render the current frame, draw to the screen
           if random.random() < epsilon:               # epsilon-greedy exploration strategy, choose a random action with probability epsilon
               action = env.action_space.sample()      # Choose a random action (exploration)
           else:
               action = model.predict(np.expand_dims(state, axis=0)).numpy()   # Choose the action that maximizes the Q Value calculated by the model
               action = action[0]

           # Let the environment execute the action, obtaining the next state after the action, the reward of the action, whether the game is over, and additional information
           next_state, reward, done, info = env.step(action)
           # If the game is Game Over, give a large negative reward
           reward = -10. if done else reward
           # Place the tuple (state, action, reward, next_state) (plus done label indicating whether it ended) into the experience replay pool
           replay_buffer.append((state, action, reward, next_state, 1 if done else 0))
           # Update the current state
           state = next_state

           if done:                                    # Exit the loop if the game is over, proceed to the next episode
               print("episode %d, epsilon %f, score %d" % (episode_id, epsilon, t))
               break

           if len(replay_buffer) >= batch_size:
               # Randomly take a batch of tuples from the experience replay pool and convert them into NumPy arrays
               batch_state, batch_action, batch_reward, batch_next_state, batch_done = zip(
                   *random.sample(replay_buffer, batch_size))
               batch_state, batch_reward, batch_next_state, batch_done = [np.array(a, dtype=np.float32) for a in [batch_state, batch_reward, batch_next_state, batch_done]]
               batch_action = np.array(batch_action, dtype=np.int32)

               q_value = model(batch_next_state)
               y = batch_reward + (gamma * tf.reduce_max(q_value, axis=1)) * (1 - batch_done)  # Calculate y value
               with tf.GradientTape() as tape:
                   loss = tf.keras.losses.mean_squared_error(  # Minimize the distance between y and Q-value
                       y_true=y,
                       y_pred=tf.reduce_sum(model(batch_state) * tf.one_hot(batch_action, depth=2), axis=1)
                   )
               grads = tape.gradient(loss, model.variables)
               optimizer.apply_gradients(grads_and_vars=zip(grads, model.variables))       # Calculate gradients and update parameters

For different tasks (or environments), we need to design different states and use appropriate networks to fit the Q function based on the characteristics of the task. For example, if we consider the classic Breakout game (in the Gym environment library as Breakout-v0), every time an action is executed (the paddle moves left, right, or stays still), a 210 * 160 * 3 RGB image is returned, representing the current screen image. To design an appropriate state representation for the Breakout game task, we have the following analysis:

The color information of the bricks is not very important; converting the image to grayscale does not affect the operation, so we can remove the color information from the state (i.e., convert the image to grayscale representation);
The movement information of the ball is very important; if we only know the single frame image and do not know the direction of the ball’s movement, even a person would find it difficult to judge the direction the paddle should move. Therefore, it is necessary to include information representing the direction of the ball’s movement in the state. A simple way is to superimpose the current frame with the previous few frames to obtain a 210 * 160 * X (where X is the number of frames to superimpose) state representation;
The resolution of each frame does not need to be particularly high; it only needs to roughly represent the positions of the block, ball, and paddle to make decisions, so the width and height of each frame can be appropriately compressed.
Breakout-v0https://gym.openai.com/envs/Breakout-v0/

Considering that we need to extract features from image information, using CNN to fit the Q function network will be more appropriate. Thus, replacing the above QNetwork with a CNN network and making some modifications to the state can be used to play some simple video games.

Table of Contents for “Simple and Brutal TensorFlow 2”

TensorFlow 2 Installation Guide
TensorFlow 2 Basics: Tensors, Automatic Differentiation, and Optimizers
TensorFlow 2 Models: Establishing Model Classes
TensorFlow 2 Models: Multi-layer Perceptron
TensorFlow 2 Models: Convolutional Neural Networks
TensorFlow 2 Models: Recurrent Neural Networks
TensorFlow 2 Models: Deep Reinforcement Learning (This Article)
TensorFlow 2 Models: Keras Training Process and Custom Components
TensorFlow 2 Common Modules 1: Checkpoint
TensorFlow 2 Common Modules 2: TensorBoard
TensorFlow 2 Common Modules 3: tf.data
TensorFlow 2 Common Modules 3: tf.data Pipeline Acceleration
TensorFlow 2 Common Modules 4: TFRecord
TensorFlow 2 Common Modules 5: @tf.function
TensorFlow 2 Common Modules 6: tf.TensorArray
TensorFlow 2 Common Modules 7: tf.config
TensorFlow 2 Deployment: Model Export
TensorFlow 2 Deployment: TensorFlow Serving
TensorFlow 2 Distributed Training
TensorFlow 2 Datasets Data Loading
Graph Execution Mode in TensorFlow 2
tf.GradientTape Explained
TensorFlow Performance Optimization

TensorFlow 2 Models: Deep Reinforcement Learning

Leave a Comment Cancel reply