Implementing RNN and LSTM with Pure NumPy

Machine Heart Report

Contributor: Siyuan

With the popularity of frameworks like TensorFlow and PyTorch, building neural networks often just involves calling a few API lines. Most developers have become unfamiliar with the underlying mechanisms, especially how to implement neural networks using pure NumPy. Previously, Machine Heart introduced how to implement a simple convolutional neural network using NumPy, but today we will discuss how to implement LSTM and other recurrent neural networks using NumPy.

Generally, implementing deep networks with pure NumPy faces two main challenges. First, for forward propagation, convolutional and recurrent networks are not as intuitively implemented as fully connected networks. For computational performance, there are discrepancies between practical code and theory. Secondly, after implementing forward propagation, we also need to implement backpropagation, which requires a solid understanding of matrix differentiation and the chain rule.

Although NumPy cannot utilize the parallel computing capabilities of GPUs, using it allows for a clear understanding of the underlying numerical computing processes, which may explain why courses like CS231n require manually implementing deep networks with NumPy.

Project address: https://github.com/krocki/dnc

In this project, the author primarily uses NumPy to implement DNC, RNN, and LSTM, with the RNN code referencing previous work by A. Karpathy. Additionally, the author wrote a gradient check to ensure the correctness of the implementation. Doesn’t it seem like the term gradient checking has gradually faded since the popularity of deep learning frameworks?

Specifically, this project is an implementation of the paper “Hybrid computing using a neural network with dynamic external memory” published by DeepMind in 2016 in Nature, which is the differentiable neural computer (DNC), with the example task being character-level prediction. The repo also includes implementations of RNN (rnn-numpy.py) and LSTM (lstm-numpy.py), with some external data (ptb, wiki) needing to be downloaded separately.

As shown below, the forward propagation process for LSTM can be implemented by changing xrange to range in Python 2.7 ˉ(ツ)/ˉ:

 loss = 0

 # forward pass
 for t in range(len(inputs)):

 # encode in 1-of-k representation
 xs[t] = np.zeros((M, B))
 for b in range(0,B): xs[t][:,b][inputs[t][b]] = 1
 # gates, linear part
 gs[t] = np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh

 # gates nonlinear part
 #i, o, f gates
 gs[t][0:3*HN,:] = sigmoid(gs[t][0:3*HN,:])
 #c gate
 gs[t][3*HN:4*HN, :] = np.tanh(gs[t][3*HN:4*HN,:]) 

 #mem(t) = c gate * i gate + f gate * mem(t-1)
 cs[t] = gs[t][3*HN:4*HN,:] * gs[t][0:HN,:] + gs[t][2*HN:3*HN,:] * cs[t-1]
 # mem cell - nonlinearity
 cs[t] = np.tanh(cs[t])
 # new hidden state
 hs[t] = gs[t][HN:2*HN,:] * cs[t]
 # unnormalized log probabilities for next chars
 ys[t] = np.dot(Why, hs[t]) + by

 ###################
 mx = np.max(ys[t], axis=0)
 # normalize
 ys[t] -= mx 
 # probabilities for next chars
 ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t]), axis=0) 

 for b in range(0,B):
 # softmax (cross-entropy loss)
 if ps[t][targets[t,b],b] > 0: loss += -np.log(ps[t][targets[t,b],b]) 

As shown in the code above, the outer loop t represents different time steps. At each time step, we first need to calculate the activation values of different gates. These three gates are calculated together, which is somewhat different from the three independent formulas we see theoretically, but it is reasonable. Next, we sequentially compute the current memory content cs[t], hidden unit output hs[t], and finally the probability prediction ys[t] according to the LSTM cell computation process. Finally, we only need to calculate the loss based on the prediction and add it to the total loss.

In addition to the forward propagation mentioned above, the more impressive part is the backpropagation of RNN and LSTM, specifically backpropagation through time (BPTT), which requires readers to refer to the code and test it themselves.

Usage of the Project

Besides reading the source code, we can also directly test the model’s performance through the command line, first checking the key structures and codes like gradient checks:

python dnc-debug.py

The following versions are all prepared:

python rnn-numpy.py
python lstm-numpy.py
python dnc-numpy.py

This project has these features: numerical computation relies solely on NumPy, batch processing is added, RNN can be modified to LSTM, and gradient checking can be performed.

This project has implemented an LSTM controller, 2D memory array, and content-addressable read/write. However, there is a problem where the softmax of key similarity can lead to crashes (division by 0), and if this occurs, a restart is needed. The repo also has some areas that need completion or improvement, including dynamic memory allocation and deallocation, and implementing faster, saveable models.

When sampling output, we can obtain data including time, number of iterations, BPC (bits per character of prediction error, the lower the better), and processing speed (char/s).

0: 4163.009 s, iter 104800, 1.2808 BPC, 1488.38 char/s

The numerical gradient check of backpropagation is displayed below (the values in the rightmost column should be less than 1e-4), and the middle column shows the range of calculated analytical and numerical gradients (these should match more or less).

GRAD CHECK

Wxh: n = [-1.828500e-02, 5.292866e-03] min 3.005175e-09, max 3.505012e-07
 a = [-1.828500e-02, 5.292865e-03] mean 5.158434e-08 # 10/4
Whh: n = [-3.614049e-01, 6.580141e-01] min 1.549311e-10, max 4.349188e-08
 a = [-3.614049e-01, 6.580141e-01] mean 9.340821e-09 # 10/10
Why: n = [-9.868277e-02, 7.518284e-02] min 2.378911e-09, max 1.901067e-05
 a = [-9.868276e-02, 7.518284e-02] mean 1.978080e-06 # 10/10
Whr: n = [-3.652128e-02, 1.372321e-01] min 5.520914e-09, max 6.750276e-07
 a = [-3.652128e-02, 1.372321e-01] mean 1.299713e-07 # 10/10
Whv: n = [-1.065475e+00, 4.634808e-01] min 6.701966e-11, max 1.462031e-08
 a = [-1.065475e+00, 4.634808e-01] mean 4.161271e-09 # 10/10
Whw: n = [-1.677826e-01, 1.803906e-01] min 5.559963e-10, max 1.096433e-07
 a = [-1.677826e-01, 1.803906e-01] mean 2.434751e-08 # 10/10
Whe: n = [-2.791997e-02, 1.487244e-02] min 3.806438e-08, max 8.633199e-06
 a = [-2.791997e-02, 1.487244e-02] mean 1.085696e-06 # 10/10
Wrh: n = [-7.319636e-02, 9.466716e-02] min 4.183225e-09, max 1.369062e-07
 a = [-7.319636e-02, 9.466716e-02] mean 3.677372e-08 # 10/10
Wry: n = [-1.191088e-01, 5.271329e-01] min 1.168224e-09, max 1.568242e-04
 a = [-1.191088e-01, 5.271329e-01] mean 2.827306e-05 # 10/10
bh: n = [-1.363950e+00, 9.144058e-01] min 2.473756e-10, max 5.217119e-08
 a = [-1.363950e+00, 9.144058e-01] mean 7.066159e-09 # 10/10
by: n = [-5.594528e-02, 5.814085e-01] min 1.604237e-09, max 1.017124e-05
 a = [-5.594528e-02, 5.814085e-01] mean 1.026833e-06 # 10/10

Report by Machine Heart, please contact this public account for authorization.

✄————————————————

Join Machine Heart (Full-time Reporter / Intern): [email protected]

Submission or seeking coverage: content@jiqizhixin.com

Advertising & Business Cooperation: [email protected]

Leave a Comment