Visualizing the Structure of LSTM Models

Author on Zhihu | master苏

Link | https://zhuanlan.zhihu.com/p/139617364

This article is approximately 3200 words, recommended reading 5 minutes

This article introduces the visualization of the structure of LSTM models.

I have recently been learning about the application of LSTM in time series prediction, but I encountered a significant issue: the structure of LSTM becomes very difficult to understand when time steps are added to the traditional BP network. Additionally, its input and output data formats are also hard to comprehend. There are many articles online that introduce the LSTM structure, but they are not intuitive and are very unfriendly to beginners. I struggled for a long time and only understood the intricacies after looking at various resources and LSTM structure diagrams shared by others. The content of this article is as follows:

1. Traditional BP Networks and CNN Networks2. LSTM Networks3. Input Structure of LSTM4. LSTM in PyTorch4.1 Defining LSTM Model in PyTorch4.2 Data Format for Feeding LSTM4.3 Output Format of LSTM5. Combining LSTM with Other Networks

Traditional BP Networks and CNN Networks

BP networks and CNN networks do not have a time dimension, and they are quite similar to traditional machine learning algorithms in terms of understanding. When CNN processes a color image with three channels, it can also be understood as stacking multiple layers. The three-dimensional matrix of the image can be viewed as spatial slices, and one can simply follow the diagram and stack layers when writing code. The following image shows a typical BP network and a CNN network.

Visualizing the Structure of LSTM Models

BP Network

CNN Network

The hidden layers, convolutional layers, pooling layers, and fully connected layers in the diagram all exist in reality, stacked layer by layer, which is easy to understand spatially. Therefore, when writing code, one basically looks at the diagram to write code. For instance, using Keras, it would look like this:

# Example code, not practically significantmodel = Sequential()model.add(Conv2D(32, (3, 3), activation='relu')) # Add convolutional layermodel.add(MaxPooling2D(pool_size=(2, 2))) # Add pooling layermodel.add(Dropout(0.25)) # Add dropout layermodel.add(Conv2D(32, (3, 3), activation='relu')) # Add convolutional layermodel.add(MaxPooling2D(pool_size=(2, 2))) # Add pooling layermodel.add(Dropout(0.25)) # Add dropout layer.... # Add other convolution operationsmodel.add(Flatten()) # Flatten 3D array to 2D arraymodel.add(Dense(256, activation='relu')) # Add regular fully connected layermodel.add(Dropout(0.5))model.add(Dense(10, activation='softmax')).... # Train the network

LSTM Networks

When we search for LSTM structures online, the most common image we see is the one below:

RNN Network

This is the classic structure diagram of RNN (Recurrent Neural Network), and LSTM is merely an improvement on the hidden layer node A, with the overall structure remaining unchanged. Thus, this article also discusses the visualization of this structure.

The middle A node represents the hidden layer, and the left side indicates an LSTM network with only one hidden layer. The so-called LSTM recurrent neural network utilizes cycles on the time axis, which is expanded on the time axis to yield the right diagram.

Looking at the left diagram, many students mistakenly think that LSTM has a single input and output, with only one hidden neuron in the network structure. However, viewing the right diagram, they believe that LSTM has multiple inputs and outputs, with several hidden neurons in the network structure; the number of A nodes represents the number of hidden layer nodes.

What the heck? It’s hard to wrap my head around this. This reflects the traditional network and spatial structure thinking.

In actuality, in the right diagram, Xt represents the sequence, and the subscript t is the time axis. Therefore, the number of A indicates the length of the time axis, representing the state (Ht) of the same neuron at different times, not the number of hidden layer neurons.

We know that the LSTM network uses information from the previous moment along with the current input information to jointly train during training.

For example: On the first day, I got sick (initial state H0), then took medicine (using input information X1 to train the network), on the second day I improved but was not completely well (H1), took medicine again (X2), and my condition improved (H2), repeating this cycle until I recovered. Thus, the input Xt is taking medicine, the time axis T is the number of days of medication, and the hidden layer state is the condition. Therefore, I am still me, just in different states.

In reality, the LSTM network looks like this:

The above diagram represents an LSTM network with two hidden layers. When viewed at time T=1, it is an ordinary BP network, and at T=2, it is also an ordinary BP network. However, when expanded along the time axis, the hidden layer information H and C trained at T=1 will be passed to the next moment T=2, as shown in the diagram below. The five arrows pointing to the right in the diagram also indicate the transmission of hidden layer states along the time axis.

Note that in the diagram, H represents the hidden layer state, and C is the forget gate, which will be explained later regarding their dimensions.

Input Structure of LSTM

To better understand the structure of LSTM, it is also necessary to understand the input data situation of LSTM. Mimicking the appearance of a 3-channel image, the data cube of multi-sample, multi-feature data at different times with the addition of the time axis is shown in the diagram below:

Three-dimensional Data Cube

The right diagram shows the input format we commonly use in models such as XGBOOST, lightGBM, decision trees, etc., where the input data format is a matrix of this form (N*F). The left side, however, shows the data cube with the time axis added, which is slices along the time axis. Its dimensions are (N*T*F), where the first dimension is the number of samples, the second dimension is time, and the third dimension is the number of features, as shown in the diagram below:

This type of data cube is common in many scenarios, such as weather forecast data, where samples can be understood as cities, the time axis as dates, and features as weather-related factors like rainfall, wind speed, PM2.5, etc. This data cube becomes easy to understand. In NLP, a sentence can be embedded into a matrix, where the order of words represents the time axis T, and the embedding of multiple sentences forms a three-dimensional matrix as shown in the diagram below:

LSTM in PyTorch

4.1 Defining LSTM Model in PyTorch

The parameters for defining the LSTM model in PyTorch are as follows:

class torch.nn.LSTM(*args, **kwargs)Parameters include:input_size: Dimension of features in xhidden_size: Dimension of the hidden layer's featuresnum_layers: Number of LSTM hidden layers, default is 1bias: If False, then bihbih=0 and bhhbhh=0. Default is Truebatch_first: If True, then the input/output data format is (batch, seq, feature)dropout: Dropout is applied to each layer's output except the last layer, default is 0bidirectional: If True, then it is a bidirectional LSTM, default is False

Combining the previous diagrams, we can look at each parameter one by one.

(1) input_size: The dimension of features x, which corresponds to F in the data cube. In NLP, it is the length of the vector after embedding a word, as shown in the diagram below:

(2) hidden_size: The dimension of the hidden layer’s features (number of hidden layer neurons), as shown in the diagram below. We have two hidden layers, each with a feature dimension of 5. Note that for non-bidirectional LSTMs, the output dimension equals the hidden layer’s feature dimension.

(3) num_layers: Number of LSTM hidden layers; in the above diagram, we defined 2 hidden layers.

(4) batch_first: Used to define the input/output dimensions, which will be discussed later.

(5) bidirectional: Whether it is a bidirectional recurrent neural network. The diagram below shows a bidirectional recurrent neural network. Therefore, when using a bidirectional LSTM, I need to pay special attention; during forward propagation, there are (Ht, Ct) for the forward direction and (Ht’, Ct’) for the backward direction. As mentioned earlier, the output dimension of a non-bidirectional LSTM equals the hidden layer’s feature dimension, while the output dimension of a bidirectional LSTM is the number of hidden layer features * 2, and the dimensions of H and C are the length of the time axis * 2.

4.2 Data Format for Feeding LSTM

The default input data format for LSTM in PyTorch is as follows:

input(seq_len, batch, input_size)Parameters include:

<span>seq_len: Sequence length; in NLP, it is the sentence length, typically padded with pad_sequence to ensure uniform length</span>

<span>batch: The number of data instances fed to the network at once; in NLP, it is how many sentences are fed to the network at one time</span>

input_size: Feature dimension, consistent with the input_size defined in the network structure.

As mentioned earlier, if the LSTM parameter batch_first=True, then the required input format is:

input(batch, seq_len, input_size)

Just swapping the first two parameters’ positions. This is actually a relatively easy-to-understand data format. Below, I will illustrate how to construct the input for LSTM using embedding vectors in NLP.

Previously, our embedding matrix is shown in the diagram below:

If the batch is placed first, then the three-dimensional matrix appears as follows:

The transformation process is illustrated in the diagram below:

Did you understand? This is the input data format, and it’s quite simple.

The other two inputs for LSTM are h0 and c0, which can be understood as the network’s initialization parameters, generated using random numbers.

h0(num_layers * num_directions, batch, hidden_size)c0(num_layers * num_directions, batch, hidden_size)Parameters:num_layers: Number of hidden layers

<span>num_directions: For a unidirectional recurrent network, num_directions=1; for bidirectional, num_directions=2</span>

batch: Batch of input datahidden_size: Number of hidden layer neurons

Note that if we define the input format as:

input(batch, seq_len, input_size)

Then the formats for H and C must also change:

h0(batch, num_layers * num_directions, h, hidden_size)c0(batch, num_layers * num_directions, h, hidden_size)

4.3 Output Format of LSTM

The output of LSTM is a tuple, as follows:

output,(ht, ct) = net(input)output: The output of the hidden layer neurons at the last stateht: The state value of the hidden layer at the last statect: The value of the forget gate at the last state of the hidden layer

The default dimensions for output are:

output(seq_len, batch, hidden_size * num_directions)ht(num_layers * num_directions, batch, hidden_size)ct(num_layers * num_directions, batch, hidden_size)

input(batch, seq_len, input_size)

ht(batch, num_layers * num_directions, h, hidden_size)ct(batch, num_layers * num_directions, h, hidden_size)

Having said so much, let’s look back at where ht and ct are located. Please see the diagram below:

Where is the output? Please see the diagram below:

Combining LSTM with Other Networks

Do you remember? The dimension of output equals the number of hidden layer neurons, which is hidden_size. In some time series predictions, a fully connected layer is often added after the output, where the input dimension of the fully connected layer equals LSTM’s hidden_size. The subsequent processing of the network is similar to that of a BP network, as shown in the diagram below:

Implementing the above structure in PyTorch:

import torchfrom torch import nnclass RegLSTM(nn.Module): def __init__(self): super(RegLSTM, self).__init__() # Define LSTM self.rnn = nn.LSTM(input_size, hidden_size, hidden_num_layers)

<span> <span># Define regression layer network, input feature dimension equals LSTM's output, output dimension is 1</span></span>

 self.reg = nn.Sequential( nn.Linear(hidden_size, 1) ) def forward(self, x): x, (ht,ct) = self.rnn(x) seq_len, batch_size, hidden_size= x.shape x = y.view(-1, hidden_size) x = self.reg(x) x = x.view(seq_len, batch_size, -1) return x

Of course, some models use the output as the input for another LSTM or use the information from the hidden layers ht, ct for modeling, which varies.

Reference links:

https://zhuanlan.zhihu.com/p/94757947

https://zhuanlan.zhihu.com/p/59862381

https://zhuanlan.zhihu.com/p/36455374

https://www.zhihu.com/question/41949741/answer/318771336

https://blog.csdn.net/android_ruben/article/details/80206792

Editor: Wang Jing

Proofreader: Lin Yilin

Reference links can be found at the bottom left corner of the article. For academic sharing only; if there is any infringement, please delete immediately.

Editor /Garvey

Review / Fan Ruiqiang

Verification / Garvey

Click below

Leave a Comment Cancel reply