By / Li Xihan, Google Developers Expert
This article is excerpted from “Simple and Brutal TensorFlow 2”, reply “Manual” to get the collection.

It should have been introduced long ago, the deep reinforcement learning in TensorFlow, yes, it is finally done!
This article will introduce the process of implementing the Q-learning algorithm using TensorFlow in the OpenAI gym environment to play the CartPole game.
Deep Reinforcement Learning (DRL)
Reinforcement learning (RL) emphasizes how to act based on the environment to maximize expected benefits. Reinforcement learning combined with deep learning technology (Deep Reinforcement Learning, DRL) is even more powerful. The well-known AlphaGo in recent years is a typical application of deep reinforcement learning.
-
Reinforcement Learninghttps://zh.wikipedia.org/wiki/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0
“
Note You can refer to the introduction to reinforcement learning in the appendix of the manual for basic knowledge of reinforcement learning.https://tf.wiki/zh/appendix/rl.html
Here, we use deep reinforcement learning to play the CartPole game. The CartPole is a classic problem in control theory, where a pole is connected to a small cart by an axis, and the center of gravity of the pole is above the axis, making it an unstable system. Under the influence of gravity, the pole can easily fall. We need to control the cart to move left and right on a horizontal track to keep the pole in a vertical balance state.
CartPole Game
We use the CartPole game environment in the Gym environment library launched by OpenAI, which can be installed using pip install gym
. For specific installation steps and tutorials, refer to the official documentation and here. The interaction process with Gym is similar to a turn-based game. We first obtain the initial state of the game (such as the initial angle of the pole and the position of the cart), and then in each turn t, we need to choose one of the current available actions to execute with Gym (for example, push the cart left or right, where only one of the two can be chosen in each turn). After executing the action, Gym returns the next state and the reward value obtained in the current turn (for example, if we choose to push the cart left and execute it, the cart’s position moves further left, and the pole’s angle moves further right, Gym returns the new angle and position to us. If the pole does not fall in this turn, Gym also returns a small positive reward). This process can continue iterating until the game ends (for example, the pole falls). In Python, the basic calling method of Gym is as follows:
import gym
env = gym.make('CartPole-v1') # Instantiate a game environment, parameter is the game name
state = env.reset() # Initialize the environment and get the initial state
while True:
env.render() # Render the current frame, draw to the screen
action = model.predict(state) # Assume we have a trained model that can predict the action based on the current state
next_state, reward, done, info = env.step(action) # Let the environment execute the action, obtaining the next state after the action, the reward of the action, whether the game is over, and additional information
if done: # Exit the loop if the game is over
break
-
Gym environment library launched by OpenAIhttps://gym.openai.com/
-
Official documentationhttps://gym.openai.com/docs/
-
Herehttps://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/4-4-gym/
Thus, our task is to train a model that can predict a good action based on the current state. Roughly speaking, a good action should maximize the total reward obtained during the entire game process, which is also the goal of reinforcement learning. Taking the CartPole game as an example, our goal is to make appropriate actions to keep the pole from falling, that is, to maximize the number of turns of game interaction as much as possible. Each time a turn is made, we receive a small positive reward, and the more turns, the higher the accumulated reward value. Therefore, maximizing the total reward during the game process is consistent with our ultimate goal.
The following code demonstrates how to use the Deep Q-Learning method in deep reinforcement learning [Mnih2013] to train the model. First, we import TensorFlow, Gym, and some common libraries, and define some model hyperparameters:
import tensorflow as tf
import numpy as np
import gym
import random
from collections import deque
num_episodes = 500 # Total number of episodes for game training
num_exploration_episodes = 100 # Number of episodes for the exploration process
max_len_episode = 1000 # Maximum number of turns for each episode
batch_size = 32 # Batch size
learning_rate = 1e-3 # Learning rate
gamma = 1. # Discount factor
initial_epsilon = 1. # Exploration rate at the start of exploration
final_epsilon = 0.01 # Exploration rate at the end of exploration
Then, we use tf.keras.Model
to build a Q function network (Q-network) to fit the Q function in Q Learning. Here we use a simpler fully connected neural network for fitting. The network takes the current state as input and outputs the Q-value for each action (for CartPole, it is 2-dimensional, i.e., pushing the cart left and right).
class QNetwork(tf.keras.Model):
def __init__(self):
super().__init__()
self.dense1 = tf.keras.layers.Dense(units=24, activation=tf.nn.relu)
self.dense2 = tf.keras.layers.Dense(units=24, activation=tf.nn.relu)
self.dense3 = tf.keras.layers.Dense(units=2)
def call(self, inputs):
x = self.dense1(inputs)
x = self.dense2(x)
x = self.dense3(x)
return x
def predict(self, inputs):
q_values = self(inputs)
return tf.argmax(q_values, axis=-1)
-
[Mnih2013]http://arxiv.org/abs/1312.5602
Finally, we implement the Q Learning algorithm in the main program.
if __name__ == '__main__':
env = gym.make('CartPole-v1') # Instantiate a game environment, parameter is the game name
model = QNetwork()
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
replay_buffer = deque(maxlen=10000) # Use a deque as the experience replay pool for Q Learning
epsilon = initial_epsilon
for episode_id in range(num_episodes):
state = env.reset() # Initialize the environment and get the initial state
epsilon = max( # Calculate the current exploration rate
initial_epsilon * (num_exploration_episodes - episode_id) / num_exploration_episodes,
final_epsilon)
for t in range(max_len_episode):
env.render() # Render the current frame, draw to the screen
if random.random() < epsilon: # epsilon-greedy exploration strategy, choose a random action with probability epsilon
action = env.action_space.sample() # Choose a random action (exploration)
else:
action = model.predict(np.expand_dims(state, axis=0)).numpy() # Choose the action that maximizes the Q Value calculated by the model
action = action[0]
# Let the environment execute the action, obtaining the next state after the action, the reward of the action, whether the game is over, and additional information
next_state, reward, done, info = env.step(action)
# If the game is Game Over, give a large negative reward
reward = -10. if done else reward
# Place the tuple (state, action, reward, next_state) (plus done label indicating whether it ended) into the experience replay pool
replay_buffer.append((state, action, reward, next_state, 1 if done else 0))
# Update the current state
state = next_state
if done: # Exit the loop if the game is over, proceed to the next episode
print("episode %d, epsilon %f, score %d" % (episode_id, epsilon, t))
break
if len(replay_buffer) >= batch_size:
# Randomly take a batch of tuples from the experience replay pool and convert them into NumPy arrays
batch_state, batch_action, batch_reward, batch_next_state, batch_done = zip(
*random.sample(replay_buffer, batch_size))
batch_state, batch_reward, batch_next_state, batch_done = [np.array(a, dtype=np.float32) for a in [batch_state, batch_reward, batch_next_state, batch_done]]
batch_action = np.array(batch_action, dtype=np.int32)
q_value = model(batch_next_state)
y = batch_reward + (gamma * tf.reduce_max(q_value, axis=1)) * (1 - batch_done) # Calculate y value
with tf.GradientTape() as tape:
loss = tf.keras.losses.mean_squared_error( # Minimize the distance between y and Q-value
y_true=y,
y_pred=tf.reduce_sum(model(batch_state) * tf.one_hot(batch_action, depth=2), axis=1)
)
grads = tape.gradient(loss, model.variables)
optimizer.apply_gradients(grads_and_vars=zip(grads, model.variables)) # Calculate gradients and update parameters
210 * 160 * 3
RGB image is returned, representing the current screen image. To design an appropriate state representation for the Breakout game task, we have the following analysis:-
The color information of the bricks is not very important; converting the image to grayscale does not affect the operation, so we can remove the color information from the state (i.e., convert the image to grayscale representation); -
The movement information of the ball is very important; if we only know the single frame image and do not know the direction of the ball’s movement, even a person would find it difficult to judge the direction the paddle should move. Therefore, it is necessary to include information representing the direction of the ball’s movement in the state. A simple way is to superimpose the current frame with the previous few frames to obtain a 210 * 160 * X
(where X is the number of frames to superimpose) state representation; -
The resolution of each frame does not need to be particularly high; it only needs to roughly represent the positions of the block, ball, and paddle to make decisions, so the width and height of each frame can be appropriately compressed.
-
Breakout-v0https://gym.openai.com/envs/Breakout-v0/
Considering that we need to extract features from image information, using CNN to fit the Q function network will be more appropriate. Thus, replacing the above QNetwork
with a CNN network and making some modifications to the state can be used to play some simple video games.
Table of Contents for “Simple and Brutal TensorFlow 2”
-
TensorFlow 2 Installation Guide
-
TensorFlow 2 Basics: Tensors, Automatic Differentiation, and Optimizers
-
TensorFlow 2 Models: Establishing Model Classes
-
TensorFlow 2 Models: Multi-layer Perceptron
-
TensorFlow 2 Models: Convolutional Neural Networks
-
TensorFlow 2 Models: Recurrent Neural Networks
-
TensorFlow 2 Models: Deep Reinforcement Learning (This Article)
-
TensorFlow 2 Models: Keras Training Process and Custom Components
-
TensorFlow 2 Common Modules 1: Checkpoint
-
TensorFlow 2 Common Modules 2: TensorBoard
-
TensorFlow 2 Common Modules 3: tf.data
-
TensorFlow 2 Common Modules 3: tf.data Pipeline Acceleration
-
TensorFlow 2 Common Modules 4: TFRecord
-
TensorFlow 2 Common Modules 5: @tf.function
-
TensorFlow 2 Common Modules 6: tf.TensorArray
-
TensorFlow 2 Common Modules 7: tf.config
-
TensorFlow 2 Deployment: Model Export
-
TensorFlow 2 Deployment: TensorFlow Serving
-
TensorFlow 2 Distributed Training
-
TensorFlow 2 Datasets Data Loading
-
Graph Execution Mode in TensorFlow 2
-
tf.GradientTape Explained
-
TensorFlow Performance Optimization
