Overview of Deep Learning Models and Their Principles

Originally from Python AI Frontiers

This article systematically and comprehensively organizes the introduction and algorithm principles of various deep learning models.

1 Main Text

Deep learning methods utilize neural network models for advanced pattern recognition and automatic feature extraction, achieving significant results in the field of data mining in recent years. Common models include not only the basic DNN but also RNN, LSTM, GRU, CNN, Attention, and mixed models like Mix.

Compared to complex feature engineering in machine learning, deep learning models require only data preprocessing, network structure design, and hyperparameter tuning to output prediction results. Deep learning algorithms can automatically learn patterns and trends in time-series data, demonstrating excellent expressiveness for complex nonlinear patterns. When applying these models, one must consider data stationarity and periodicity, select appropriate models and parameters, conduct training and testing, and perform tuning and validation.

2 Overview of Deep Learning Algorithms

2.1 RNN Class

In RNN, the input at each moment and the state from previous moments are carefully mapped and merged into a hidden state, which, under the joint action of the current input and prior state, accurately predicts the output at the next moment. A notable feature of RNN is its powerful capability to handle variable-length sequence data, making it adept at processing time-series data and providing unique advantages for time-series forecasting. Moreover, to further enhance the model’s expressiveness and memory capacity, RNN can cleverly incorporate advanced gating mechanisms such as LSTM, GRU, and SRU, thus constructing more powerful and flexible neural network models.

2.1.1 RNN (1990)

Paper: Finding Structure in Time

RNN (Recurrent Neural Network) is a powerful deep learning model widely used in time-series forecasting tasks. By unfolding the neural network over the time dimension, it effectively transmits historical information to the future, thereby addressing the inherent temporal dependencies and dynamic changes in time-series data. When constructing RNN models, LSTM and GRU models are favored for their ability to handle long sequences and accurately capture temporal dependencies in time-series data through memory cells and gating mechanisms.

2.1.2 LSTM (1997)

Paper: Long Short-Term Memory

LSTM (Long Short-Term Memory) is a commonly used recurrent neural network model frequently employed for time-series forecasting. Compared to the basic RNN model, LSTM has stronger memory and long-term dependency capabilities, allowing it to better handle temporal dependencies and dynamic changes in time-series data. In constructing the LSTM model, the design and parameter tuning of LSTM units are crucial. The design of LSTM units can influence the model’s memory capacity and long-term dependency capabilities, while parameter tuning affects the model’s prediction accuracy and robustness.

# LSTMmodel = RNNModel(model="LSTM",hidden_dim=60,dropout=0,batch_size=100,n_epochs=200,optimizer_kwargs={"lr": 1e-3},    # model_name="Air_RNN",log_tensorboard=True,random_state=42,training_length=20,input_chunk_length=60,    # force_reset=True,    # save_checkpoints=True,)

2.1.3 GRU (2014)

Paper: Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

GRU (Gated Recurrent Unit) is a commonly used recurrent neural network model, structurally similar to LSTM, specifically designed to capture deep information in time-series data. Compared to LSTM, GRU maintains the ability to handle temporal dependencies and dynamic changes while having a more streamlined parameter count and faster computation speed. Its core lies in the clever design of the GRU unit and the precise tuning of its parameters. The design of the GRU unit not only affects the model’s memory capacity but also has profound implications for its ability to capture long-term dependencies; precise tuning of parameters directly relates to the model’s prediction accuracy and robustness.

GRUmodel = RNNModel(model="GRU",hidden_dim=60,dropout=0,batch_size=100,n_epochs=200,optimizer_kwargs={"lr": 1e-3},    # model_name="Air_RNN",log_tensorboard=True,random_state=42,training_length=20,input_chunk_length=60,    # force_reset=True,    # save_checkpoints=True,)

2.1.4 SRU (2018)

Paper: Simple Recurrent Units for Highly Parallelizable Recurrence

SRU (Simple Recurrent Unit) is an innovative recurrent neural network model designed based on efficient matrix computation for processing time-series data. Compared to traditional LSTM and GRU models, SRU significantly reduces the parameter count while maintaining efficient handling of temporal dependencies and dynamic changes, thus enhancing computational speed. The performance of the SRU model relies on the cleverness of its unit design and the precision of parameter tuning. Well-designed SRU units can strengthen the model’s memory capacity and long-term dependency capturing ability, while precise tuning of parameters plays a crucial role in the model’s prediction accuracy and robustness.

2.2 CNN Class

CNN, with its unique structure of convolutional and pooling layers, can automatically extract key features from time-series data, achieving efficient and accurate time-series forecasting. In practical applications, we need to convert one-dimensional time-series data into two-dimensional matrix form and use CNN’s convolution and pooling operations for feature extraction and compression. Finally, predictions are made through fully connected layers. Compared to traditional time-series forecasting methods, CNN has gradually become a leader in the field of time-series forecasting due to its powerful feature learning ability, efficient computational efficiency, and excellent prediction accuracy.

2.2.1 WaveNet (2016)

Paper: WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

WaveNet, proposed by the DeepMind team in 2016, is a groundbreaking neural network model whose core idea is to use convolutional neural networks to simulate the waveform characteristics of audio signals. Through the integration of residual connections and gated convolution operations, WaveNet significantly enhances the model’s representational ability. Besides excelling in the field of speech generation, WaveNet is also suitable for time-series forecasting tasks. In practical applications, we can treat time-series as one-dimensional vectors and input them into the WaveNet model to achieve accurate predictions of future time steps.

When constructing the WaveNet model, the design of convolutional layers and parameter tuning is crucial. Well-designed convolutional layers can enhance the model’s expressiveness and generalization ability, while precise tuning of parameters directly relates to the model’s prediction accuracy and stability.

2.2.2 TCN (2018)

TCN, as a temporal convolutional network, achieves efficient processing of time-series data by introducing causal convolution and residual connections. It can not only capture long-term dependencies in sequences but also possesses excellent parallel computing capabilities, significantly improving the efficiency and accuracy of time-series forecasting. In practical applications, TCN demonstrates strong potential and broad application prospects, bringing new breakthroughs and development directions to the field of time-series forecasting.

Paper: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

TCN (Temporal Convolutional Network) is an innovative time-series forecasting algorithm built on convolutional neural networks, aiming to address the gradient vanishing and excessive computational complexity issues often encountered by traditional RNNs when processing long sequences. Compared to traditional RNNs and other sequence models, TCN, leveraging its convolutional neural network characteristics, can more efficiently capture long-term dependencies while exhibiting excellent parallel computing capabilities.

The TCN model consists of carefully designed convolutional layers and residual connections. Each convolutional layer is responsible for extracting deep features from sequence data and passing them to the next layer, thus achieving gradual abstraction and feature extraction of the data. The model also cleverly incorporates residual connection techniques similar to ResNet, effectively mitigating gradient vanishing and model degradation issues. Additionally, the application of dilated convolution further broadens the receptive field of the convolutional kernels, enhancing the model’s robustness and accuracy.

The prediction process of this model is orderly, with specific steps as follows:

Input Layer: Responsible for receiving the input of time-series data, laying the foundation for subsequent processing.
Convolutional Layer: Utilizing one-dimensional convolution techniques, it extracts and abstracts features from the input data. Each convolutional layer contains multiple convolutional kernels capable of capturing time-series patterns at different scales.
Residual Connection: Drawing on the design principles of ResNet, it effectively alleviates gradient vanishing and model degradation issues by combining the outputs of convolutional layers with their inputs, enhancing the model’s stability.
Repeating Stacking: By repeatedly stacking multiple convolutional layers and residual connections, the model can progressively extract abstract features from time-series data.
Pooling Layer: A global average pooling layer is set deep within the model to average all feature vectors, resulting in a fixed-length feature vector.
Output Layer: After processing through fully connected layers, it converts the output of the pooling layer into prediction values for the time series.

The advantages of the TCN model are significant:

It can effectively handle long-sequence data, demonstrating excellent parallel performance.
Utilizing advanced techniques such as residual connections and dilated convolutions effectively avoids gradient vanishing and overfitting.
Compared to traditional RNN models, the TCN model excels in both computational efficiency and prediction accuracy.

2.2.3 DeepTCN (2019)

Paper: Probabilistic Forecasting with Temporal Convolutional Neural Network

Code: deepTCN

DeepTCN (Deep Temporal Convolutional Networks) is a deep learning-based time-series forecasting model that deepens and expands upon the traditional TCN model. This model utilizes a set of carefully designed 1D convolutional layers and max pooling layers for deep processing of time-series data, and through stacking layers, it extracts different features. In DeepTCN, each convolutional layer is equipped with multiple 1D convolutional kernels and activation functions, while incorporating residual connections and batch normalization techniques to accelerate the training process of the model.

The training process of the DeepTCN model is rigorous and efficient, specifically including the following steps:

Data Preprocessing: Standardizing and normalizing the raw time-series data to eliminate the impact of differences in feature scales on model training.
Model Construction: Utilizing deep learning frameworks (such as TensorFlow, PyTorch, etc.) to build a DeepTCN model containing multiple 1D convolutional layers and max pooling layers.
Model Training: Using the training dataset to finely train the DeepTCN model, accurately measuring the model’s prediction performance through loss functions (such as MSE, RMSE, etc.). During the training process, optimization algorithms (such as SGD, Adam, etc.) are employed to adjust model parameters, and techniques such as batch normalization and DeepTCN are utilized to enhance the model’s generalization ability.
Model Evaluation: Using the testing dataset to comprehensively evaluate the trained DeepTCN model, calculating and comparing various performance metrics, such as Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), etc.

2.3 Attention Class

The Attention mechanism, as a core method for extracting important features from sequence input data, also plays a crucial role in time-series forecasting. The Attention mechanism can precisely focus on key parts of time-series data, providing the model with more valuable information, thereby significantly improving prediction accuracy. When using Attention for time-series forecasting, we need to leverage its ability to adaptively adjust the weights of different parts of the input data, allowing the model to concentrate more on core information while diminishing the interference from irrelevant information. The Attention mechanism is applicable not only to sequential models like RNN but also performs excellently in non-sequential models like CNN, making it a focal point of research in the current time-series forecasting field.

2.3.1 Transformer (2017)

Paper: Attention Is All You Need

The Transformer is a neural network model that has excelled in the field of natural language processing (NLP), fundamentally characterized as a sequence-to-sequence (seq2seq) mapping model. The Transformer treats each position in the sequence as an independent vector and utilizes multi-head self-attention mechanisms and feedforward neural networks to deeply explore the long-range dependencies contained within sequences, ensuring that the model can flexibly handle sequences of varying lengths.

In time-series forecasting tasks, the Transformer model can cleverly convert the time steps of the input sequence into positional information, thereby expressing the features of each time step in vector form. With the encoder-decoder framework, the Transformer model can efficiently complete prediction tasks. Specifically, we take the previous N time steps of the prediction target as the input to the encoder, while the subsequent M time steps of the prediction target serve as the input to the decoder, thereby utilizing the encoder-decoder framework for precise forecasting. Both the encoder and decoder are composed of multiple stacked Transformer modules, each of which consists of multi-head self-attention layers and feedforward neural network layers.

During the training phase, we can choose classic loss functions such as Mean Squared Error (MSE) or Mean Absolute Error (MAE) to measure the model’s prediction performance, while continuously adjusting model parameters using optimization algorithms like Stochastic Gradient Descent (SGD) or Adam to optimize performance. Additionally, to enhance training efficiency and model performance, we can employ advanced techniques such as learning rate adjustment and gradient clipping.

model = TransformerModel(input_chunk_length=30,output_chunk_length=15,batch_size=32,n_epochs=200,    # model_name="air_transformer",nr_epochs_val_period=10,d_model=16,nhead=8,num_encoder_layers=2,num_decoder_layers=2,dim_feedforward=128,dropout=0.1,optimizer_kwargs={"lr": 1e-2},activation="relu",random_state=42,    # save_checkpoints=True,    # force_reset=True,)

2.3.2 TFT (2019)

Paper: Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

TFT (Temporal Fusion Transformers) is a Transformer model that integrates time fusion, providing a powerful tool for interpretable multi-scale time series forecasting. By introducing a time fusion mechanism, TFT effectively enhances the model’s prediction accuracy in complex time series data. This model can address single time scale forecasting needs and also handle multi-scale time series forecasting problems, further expanding the model’s application scope. Moreover, the strong interpretability of the TFT model makes the prediction results more persuasive, providing robust support for decision-making.

TFT (Transformer-based Time Series Forecasting) is a time series forecasting method based on the Transformer model, ingeniously proposed by the Google DeepMind team in 2019. Its core idea lies in the clever integration of time feature embedding and modality embedding, enabling the Transformer model to more accurately capture periodic and trend features in time series data, while comprehensively considering external influencing factors (such as temperature, holidays, etc.) for predictions.

The TFT method consists of two main phases: training and forecasting. In the training phase, the method utilizes rich training data to refine the Transformer model and effectively enhance the model’s robustness and training efficiency through strategies such as random masking and adaptive learning rate adjustment. In the forecasting phase, the trained model can accurately foresee future trends in time series data.

Compared to traditional time series forecasting methods, the TFT method has unique advantages:

It can flexibly handle time series data of different scales, as the Transformer model excels at capturing both global and local features of time series.
It can comprehensively consider time series data and external influencing factors, thereby improving prediction accuracy.
It does not require manual feature extraction; it can directly learn the prediction model through end-to-end training.

HT (Hierarchical Transformer)

As another leader in the time series forecasting field, the HT model was proposed by a research team from The Chinese University of Hong Kong in 2019. This model adopts a hierarchical structure aimed at processing time series data with multiple time scales. Through an adaptive attention mechanism, the HT model can accurately capture features at different time scales, significantly enhancing prediction performance and generalization ability.

The HT model consists of two core components: a multi-scale attention module and a prediction module. In the multi-scale attention module, the HT model employs an adaptive multi-head attention mechanism to effectively fuse features from different time scales, forming a unified feature representation. In the prediction module, the features are finely predicted through fully connected layers, yielding the final prediction results.

The excellence of the HT model lies in its ability to adaptively process multi-time scale time series data and its capacity to accurately capture features through the adaptive multi-head attention mechanism. This enables it to exhibit exceptional prediction performance and good generalization ability in time series forecasting tasks, while also providing good interpretability, making it suitable for various time series forecasting scenarios.

As for LogTrans (2019), it is an advanced method aimed at enhancing locality and breaking the memory bottleneck of the Transformer in time series forecasting. This method is thoroughly discussed in the paper “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting,” providing the implementation code for Autoformer, offering new ideas and directions for research and application in the time series forecasting field. LogTrans proposes an improved time series forecasting Transformer method that cleverly combines convolutional self-attention mechanisms with LogSparse Transformer technology. The convolutional self-attention mechanism successfully incorporates local features into the attention mechanism by generating queries and keys with causal convolution characteristics, enhancing the model’s ability to capture local features in time series data. Meanwhile, LogSparse Transformer, as an efficient variant of the Transformer, optimizes memory usage, effectively reducing the memory cost of modeling long time series, making the model more efficient in handling large-scale time series data. The introduction of LogTrans effectively addresses two major issues faced by the Transformer in time series forecasting due to position-agnostic attention and memory bottlenecks.

2.3.5 DeepTTF (2020)

DeepTTF (Deep Temporal Transformational Factorization) is an advanced time series forecasting algorithm proposed by researchers at UCLA, based on deep learning and matrix factorization techniques. This method cleverly decomposes time series into multiple independent time segments and utilizes matrix factorization techniques to deeply analyze the potential relationships within each time segment, significantly enhancing the model’s prediction accuracy and interpretability.

The core architecture of the DeepTTF model includes three key links: time segmentation, matrix factorization, and predictor. In the time segmentation phase, the model divides complex time series into multiple manageable subsequences for detailed processing. In the matrix factorization phase, DeepTTF employs advanced matrix factorization techniques to effectively reveal the complex interactions between time and features within the time series. Finally, the predictor utilizes a multi-layer perceptron to make precise predictions on the decomposed subsequences and generates the final prediction results through clever combination strategies.

The uniqueness of the DeepTTF model lies in its powerful local pattern capturing ability and global trend analysis capability, allowing it to maintain high prediction accuracy even when facing complex and dynamic time series data. Additionally, this model supports a time-segmented cross-validation strategy, further enhancing the model’s robustness and generalization ability, making it perform excellently in various time series forecasting tasks.

2.3.6 PTST (2020)

PTST (Probabilistic Time Series Transformer) is an innovative time series forecasting algorithm proposed by Google Brain in 2020, based on the Transformer model and combined with probabilistic graphical models, aimed at improving the accuracy and reliability of time series forecasting. PTST successfully captures uncertainty and noise in time series data by introducing probabilistic graphical models, enhancing the model’s performance when dealing with uncertain time series data.

The core architecture of the PTST model consists of two modules: the sequence model and the probabilistic model. The sequence model employs an advanced Transformer structure, achieving deep encoding and decoding of time series data through multi-layer self-attention mechanisms. The probabilistic model innovatively introduces Variational Autoencoders (VAE) and Kalman Filters (KF), working together to handle noise and optimize smoothing in time series data.

In the training process, PTST employs the Maximum A Posteriori (MAP) estimation method to maximize the probability of the prediction results. In the forecasting phase, PTST utilizes Monte Carlo sampling methods to draw samples from the posterior distribution, generating a set of probability distributions to provide rich information support for decision-makers. Furthermore, PTST incorporates loss functions such as Mean Squared Error and Negative Log Likelihood (NLL) to comprehensively assess the model’s prediction performance.

2.3.7 Reformer (2020)

Reformer is an efficient Transformer model proposed in 2020 that significantly reduces the computational and storage resource consumption of the model while maintaining the superior performance of the Transformer. Reformer achieves efficient processing of long sequences by introducing innovative techniques such as Locality-Sensitive Hashing (LSH) attention and reversible layers, making the Transformer model more lightweight and efficient when handling large-scale time series data.

LSH attention is one of the core technologies of Reformer, which reduces the spatial complexity of attention calculations from O(n²) to O(n log n), significantly enhancing the model’s computational efficiency when processing long sequences. Additionally, the reversible layer technology allows the model to reduce memory consumption during training, further alleviating the pressure on resource consumption.

The introduction of Reformer not only provides an efficient and high-performance solution for tasks such as time series forecasting but also injects new vitality into the development of the deep learning field. In the future, as data scales continue to expand and computational resources are continuously optimized, Reformer is expected to play an important role in more scenarios. Reformer, as a neural network structure based on the Transformer model, demonstrates broad application prospects in time series forecasting tasks. It can achieve precise predictions of future time steps through sampling, autoregression, multi-step forecasting, and integration with reinforcement learning. In this process, the model cleverly utilizes known historical time step information to ensure continuity and accuracy in predictions. Notably, Reformer introduces innovative techniques such as separable convolutions and reversible layers, which not only enhance the model’s processing efficiency but also achieve significant breakthroughs in prediction accuracy, showcasing its efficient, accurate, and scalable performance. The emergence of the Reformer model undoubtedly brings a new dimension and solution strategy to time series forecasting tasks.

2.3.8 Informer (2020)

Paper: Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Code: https://github.com/zhouhaoyi/Informer2020

Informer, a time series forecasting method based on the Transformer model, originates from the outstanding contribution of the Deep Learning and Computing Intelligence Laboratory at Peking University in 2020. It is not a simple improvement of the traditional Transformer model, but rather integrates numerous innovative structures and mechanisms to meet complex time series forecasting demands.

The core idea of Informer lies in:

Utilizing a Long Short-Term Memory (LSTM) encoder-decoder structure to effectively alleviate the challenges posed by long-term dependencies in time series.
Introducing an Adaptive Length Attention (AL) mechanism, enabling the model to flexibly capture key information at different time scales, thereby enhancing prediction accuracy.
Incorporating a Multi-Scale Convolution Kernel (MSCK) mechanism to fully excavate and utilize features at different time scales, enhancing the model’s generalization ability.
Leveraging a Generative Adversarial Network (GAN) framework to further refine model performance and optimize prediction results through adversarial learning.

During the training phase, Informer supports various loss functions to guide model learning and utilizes the Adam optimization algorithm to precisely adjust model parameters. In the forecasting phase, Informer employs a sliding window technique to achieve precise predictions for future time points. After validation and comparison across multiple time series forecasting datasets, Informer has demonstrated exceptional performance in prediction accuracy, training speed, and computational efficiency.

2.3.9 TAT (2021)

TAT (Temporal Attention Transformer) is a time series forecasting algorithm proposed by the Intelligent Science Laboratory at Peking University, which innovatively expands upon the traditional Transformer model. Its core lies in the introduction of a temporal attention mechanism to more accurately capture dynamic changes in time series.

The TAT model structurally inherits the classic design of the Transformer, including multiple Encoder and Decoder layers. Each Encoder layer integrates multi-head self-attention mechanisms with feedforward networks, effectively extracting key information from the input sequence. The Decoder layer, based on the self-attention mechanism, adds attention to the Encoder’s output and generates the final prediction results through the feedforward network. Notably, the TAT model innovatively incorporates a temporal attention mechanism within the multi-head attention mechanism. This mechanism enables the model to incorporate time step information as an additional feature, allowing it to more sensitively capture dynamic changes in time series. This design not only enhances the model’s ability to model complex time series but also lays a solid foundation for its excellent performance in time series forecasting tasks.

Moreover, the TAT model adopts incremental training techniques to improve training efficiency and prediction performance. The introduction of this technique allows the model to achieve faster and more accurate convergence within limited time and computational resources, thus meeting the practical application needs for time series forecasting.

2.3.10 NHT (2021)

NHT, as a cutting-edge time series forecasting method, has garnered widespread attention in recent years. Its uniqueness lies in the combination of deep learning with traditional time series analysis methods, providing a new solution idea for time series forecasting tasks. The NHT model can effectively capture complex dynamic changes in time series while possessing powerful feature extraction and pattern recognition capabilities. Through continuous learning and optimization, the NHT model can exhibit excellent performance across multiple time series forecasting scenarios. With ongoing research and technological advancements, the NHT model is expected to play an increasingly important role in the future of time series forecasting.

Paper: Nested Hierarchical Transformer: Towards Accurate, Data-Efficient, and Interpretable Visual Understanding

NHT (Nested Hierarchical Transformer) is a deep learning algorithm specifically applied to time series forecasting. This algorithm cleverly integrates a nested hierarchical transformer structure, combining multi-layer nested self-attention mechanisms with time importance evaluation mechanisms to achieve precise insights and predictions for time series data. The NHT model upgrades the traditional self-attention mechanism by introducing more hierarchical structures, along with a dynamic regulation mechanism, namely the time importance evaluation mechanism, effectively controlling the weight distribution of different layers during the prediction process, thereby significantly enhancing prediction performance.

2.3.11 Autoformer (2021)

Paper: Autoformer: A Decomposition Transformer based on Autocorrelation for Long Sequence Forecasting

Code: https://github.com/thuml/Autoformer

AutoFormer is an innovative time series forecasting model based on the Transformer structure. Compared to traditional models like RNN and LSTM, it has the following significant advantages:

Self-Attention Mechanism: AutoFormer cleverly employs a self-attention mechanism, capable of simultaneously capturing the intricate global and local relationships within time series, effectively avoiding the gradient vanishing problem that may arise during long sequence training.
Transformer Structure: AutoFormer fully utilizes the parallel computing capabilities of the Transformer structure, significantly enhancing training efficiency to meet modern large-scale data processing needs.
Multi-Task Learning Capability: AutoFormer supports a multi-task learning paradigm, allowing simultaneous predictions of multiple time series, thus achieving efficient processing while improving prediction accuracy.

The design of the AutoFormer model is sophisticated, comprising two core components: the encoder and the decoder. The encoder deepens the extraction of input sequence features by stacking multiple self-attention layers and feedforward neural network layers; the decoder similarly transforms the encoder’s output into precise prediction sequences. Moreover, AutoFormer introduces an attention mechanism across time steps, enabling the encoder and decoder to adaptively adjust the step length as needed. Overall, AutoFormer is an efficient and precise time series forecasting model suitable for various complex and dynamic time series forecasting tasks.

2.3.12 Pyraformer (2022)

Paper: Pyraformer: A Low-Complexity Pyramid Attention Mechanism for Long-Range Time Series Modeling and Forecasting

Code: https://github.com/ant-research/Pyraformer

Pyraformer is a novel time series forecasting model that utilizes a unique low-complexity pyramid attention mechanism, providing an efficient and accurate solution for long-range time series modeling and forecasting. This model constructs a pyramid attention structure to finely capture and process information at different time scales, effectively addressing the challenges of long sequence forecasting. The outstanding performance of Pyraformer has made it a research hotspot and frontier direction in various time series forecasting tasks.

The Ant Research Institute recently proposed Pyraformer, a Transformer model based on pyramid attention, aimed at bridging the gap between capturing long-distance dependencies and achieving low spatiotemporal complexity. Specifically, Pyraformer develops a pyramid graph and transmits attention-based information. In this graph, edges are cleverly divided into two groups: inter-scale connections and intra-scale connections. Inter-scale connections build multi-resolution representations of the original sequence, where nodes at the finest scale precisely correspond to time points in the original time series (e.g., hourly observations), while nodes at coarser scales represent lower-resolution features (e.g., daily, weekly, and monthly patterns). These potential coarse-scale nodes are introduced through an innovative coarse-scale construction module. On the other hand, intra-scale edges tightly connect adjacent nodes, cleverly capturing temporal correlations at each resolution. Thus, Pyraformer efficiently captures these behaviors at coarser resolutions, shortening the length of signal traversal paths, thereby providing a concise and effective representation for long-range time dependencies between distant locations. Additionally, through sparse adjacent intra-scale connections, this model models different ranges of temporal dependencies across different scales, significantly reducing computational costs.

2.3.13 FEDformer (2022)

Paper: FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting

Code: https://github.com/MAZiqing/FEDformer

FEDformer is an innovative Transformer neural network structure specifically designed for distributed time series forecasting tasks. This model cleverly decomposes time series data into multiple smaller chunks and significantly accelerates the training process through distributed computing. FEDformer not only introduces local attention mechanisms and reversible attention mechanisms, enabling the model to more accurately capture local features in time series data, but also possesses exceptional computational efficiency. Furthermore, it supports advanced features such as dynamic partitioning, asynchronous training, and adaptive chunking, providing the model with greater flexibility and scalability.

2.3.14 Crossformer (2023)

Paper: Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting

Code: https://github.com/Thinklab-SJTU/Crossformer

Crossformer introduces a groundbreaking hierarchical Encoder-Decoder architecture, consisting of a left Encoder (gray) and a right Decoder (light orange), integrating innovative elements such as Dimension-Segment-Wise (DSW) embedding, Two-Stage Attention (TSA) layers, and Linear Projection. This design enables Crossformer to fully leverage cross-dimensional dependencies, bringing revolutionary performance improvements to multivariate time series forecasting tasks.

2.4 Mix Class

By integrating ETS, autoregressive models, RNN, CNN, and Attention algorithms, one can fully leverage their respective advantages, significantly enhancing the precision and stability of time series forecasting. This clever fusion strategy is often referred to in the industry as a “mixed model.”

In mixed models, RNN is uniquely capable of automatically capturing the intricate long-term dependencies in time series data; CNN excels at mining local and spatial features in time series data due to its superior feature extraction capability; while the Attention mechanism precisely focuses on key parts of time series data with its flexible adaptability. By organically combining these algorithms, we can construct more robust and precise time series forecasting models.

In practical applications, we can flexibly choose suitable algorithm fusion methods based on different time series forecasting scenarios and meticulously tune and optimize the models to ensure they achieve optimal performance.

2.4.1 Encoder-Decoder CNN (2017)

Paper: “Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model”

Encoder-Decoder CNN is an advanced model designed for time series forecasting tasks that cleverly combines encoders and decoders with convolutional neural networks. In this model, the encoder is responsible for deeply mining the intrinsic features of time series data, while the decoder generates future time series data based on these features.

The specific process of the Encoder-Decoder CNN model for time series forecasting is as follows:

First, historical time series data is used as input, and convolutional layers are employed for feature extraction.
Next, the feature sequence output from the convolutional layer is passed to the encoder, which gradually reduces the feature dimensions through pooling operations while properly preserving the state vector of the encoder.
Then, the state vector from the encoder is input to the decoder, which gradually reconstructs and generates future time series data through deconvolution and upsampling operations.
Finally, necessary post-processing is performed on the output of the decoder, such as mean removal or normalization, to obtain the final accurate and reliable prediction results.

Notably, during the training process of the Encoder-Decoder CNN model, appropriate loss functions (such as mean squared error or cross-entropy) should be selected, and hyperparameters adjusted according to actual needs. Additionally, to enhance the model’s generalization ability, techniques such as cross-validation should be employed for model evaluation and selection.

2.4.2 LSTNet (2018)

Paper: “Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks”

LSTNet is a deep learning model designed specifically for time series forecasting, short for Long- and Short-term Time-series Networks. This model cleverly integrates Long Short-Term Memory (LSTM) networks and one-dimensional convolutional neural networks (1D-CNN), allowing it to simultaneously handle long-term and short-term information in time series, effectively capturing seasonal and periodic changes in the sequences. LSTNet was first proposed in 2018 by Guokun Lai and others from the Institute of Computing Technology, Chinese Academy of Sciences.

The core idea of the LSTNet model lies in using CNN to extract features from time series data, which are then input to LSTM for in-depth sequence modeling. Furthermore, LSTNet introduces an adaptive weight learning mechanism to flexibly adjust the weights of long-term and short-term time series information in predictions. The model’s input is a time series matrix of shape (T, d), where T represents the number of time steps, and d represents the feature dimension at each time step. The model’s output is a prediction vector of length H, where H represents the number of time steps to be predicted. During training, LSTNet employs Mean Squared Error (MSE) as the loss function and optimizes through backpropagation algorithms.

2.4.3 TDAN (2018)

Paper: “TDAN: Temporal Difference Attention Network for Precipitation Nowcasting”

TDAN (Temporal Difference Attention Network) is an innovative model designed for precipitation nowcasting tasks, fully leveraging the advantages of temporal differences and attention mechanisms. TDAN captures temporal difference information in time series data and combines it with attention mechanisms to focus on key time periods and regions, achieving precise predictions of precipitation conditions. The outstanding performance of TDAN demonstrates its potential and application value in the field of time series forecasting. TDAN (Time-aware Deep Attentive Network) is an advanced deep learning algorithm specifically designed for time series forecasting. It cleverly captures temporal features in time series by integrating convolutional neural networks with attention mechanisms. Compared to traditional convolutional neural networks, TDAN can more efficiently utilize the temporal information within time series data, significantly enhancing prediction accuracy.

Specifically, the time series forecasting process of the TDAN algorithm can be summarized in the following steps:

First, historical time series data is used as input, and convolutional layers deeply extract the core features of the time series.
Next, the feature sequence output from the convolutional layer is passed to the attention mechanism, which precisely calculates the weighted feature vector based on the weights closely related to the current prediction in historical data.
Finally, the weighted feature vector is sent to fully connected layers for precise and efficient predictions.

Notably, during the training process of the TDAN algorithm, appropriate loss functions (such as mean squared error) should be selected, and hyperparameters adjusted according to actual needs. Additionally, to improve the model’s generalization ability, advanced techniques such as cross-validation should be applied for model evaluation and selection.

One major advantage of the TDAN algorithm is its ability to adaptively focus on the parts of historical data that are highly relevant to the current prediction, significantly enhancing the accuracy of time series forecasting. Meanwhile, TDAN effectively addresses issues such as missing values and outliers in time series data, demonstrating exceptional robustness.

2.4.4 DeepAR (2019)

Paper: “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks”

DeepAR is an innovative autoregressive recurrent neural network that cleverly combines recurrent neural networks (RNN) with autoregressive (AR) methods, focusing on predicting scalar (one-dimensional) time series. In many practical scenarios, we often face a series of similar time series with representative units. DeepAR can effectively integrate multiple similar time series data, such as sales data for different flavors of instant noodles, capturing the interrelationships within different time series through deep recurrent neural networks. This multi-target or multi-object setting helps improve overall prediction accuracy.

DeepAR can generate multi-step prediction results for selectable time spans, and the predictions for each time point are probabilistic. By default, it outputs three values: P10, P50, and P90. Here, P10 indicates a 10% likelihood that the actual value will be less than the predicted value P10. By providing probabilistic predictions, we can either combine the three values to give a deterministic prediction or use the predictions from P10 to P90 to formulate more flexible decision-making strategies.

2.4.5 N-BEATS (2020)

Paper: “N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting”

Code Repository:

https://github.com/amitesh863/nbeats_forecast

N-BEATS (Neural basis expansion analysis for interpretable time series forecasting) is a novel time series forecasting model developed by the Google Brain team, led by Oriol Vinyals. N-BEATS employs a learning-based basis function to accurately represent time series data, significantly enhancing the model’s interpretability while maintaining high prediction accuracy. Furthermore, the N-BEATS model innovatively combines stacked regression modules and inverse convolution modules, enabling it to effectively tackle challenges posed by multi-scale time series data and long-term dependencies.

Using the following code example, we can easily create and configure an N-BEATS model:

model = NBEATSModel(    input_chunk_length=30,     output_chunk_length=15,     n_epochs=100,     num_stacks=30,     num_blocks=1,     num_layers=4,     dropout=0.0,     activation='ReLU')

With the above configuration, we can adjust the model’s parameters according to specific task requirements to achieve optimal prediction results.

2.4.6 TCN-LSTM (2021)

Paper: “Anomaly Detection in Time Series Data Using LSTM and TCN”

TCN-LSTM, as an advanced model that integrates Temporal Convolutional Network (TCN) and Long Short-Term Memory (LSTM), demonstrates exceptional performance in time series forecasting tasks. In this collaborative model, the TCN layer and LSTM layer each play their role, jointly capturing the long-term and short-term features of the time series. Specifically, the TCN layer expands the receptive field by stacking multiple convolutional layers while leveraging residual connections to avoid gradient vanishing issues. The LSTM layer, with its unique memory cells and gating mechanisms, effectively captures the hidden long-term dependencies in the time series.

In the time series forecasting process, the TCN-LSTM model follows this logic: first, the TCN layer processes historical time series data to accurately extract key features in the short term; then, the processed feature sequence is input to the LSTM layer, allowing it to delve into the long-term dependencies within the time series; finally, the feature vector output from the LSTM layer is sent to fully connected layers, deriving the final prediction results through a series of calculations and integrations.

Notably, to ensure optimal prediction results for the TCN-LSTM model, appropriate loss functions (such as mean squared error) should be selected during training, and model hyperparameters adjusted according to actual situations. Additionally, advanced techniques such as cross-validation should be employed for model evaluation and selection to ensure excellent performance in practical applications.

2.4.7 NeuralProphet (2021)

Paper: “Large-Scale Neural Network Forecasting”

NeuralProphet, a time series forecasting framework meticulously crafted by Facebook, integrates innovative neural network structures based on the Prophet framework to achieve accurate predictions of time series data with complex nonlinear trends and seasonal features.

The core idea of NeuralProphet is to ingeniously combine the time series nonlinear feature learning capabilities of deep neural networks with the Prophet decomposition model to extract and predict complex time series variation patterns. This framework provides various neural network structure and optimization algorithm options, allowing users to flexibly choose and adjust according to specific application scenarios.

The notable features of NeuralProphet include:

Flexibility: NeuralProphet can easily handle time series data with complex trends and seasonality, allowing users to flexibly configure neural network structures and optimization algorithms based on actual needs.
Accuracy: Thanks to the powerful nonlinear modeling capabilities of neural networks, NeuralProphet significantly enhances the accuracy of time series forecasting, providing users with more reliable prediction results.
Interpretability: NeuralProphet offers rich visualization tools to help users intuitively understand the prediction results and their influencing factors, thereby better guiding practical applications.
User-Friendliness: NeuralProphet seamlessly integrates with programming languages like Python, providing rich APIs and example codes, enabling users to easily get started and quickly apply it to practical scenarios.

In practical applications, NeuralProphet has demonstrated broad application value across various fields such as finance, transportation, and electricity. It can not only accurately predict future trend changes but also provide robust support for decision-making, assisting users in better addressing various complex time series forecasting challenges.

2.4.8 N-HiTS (2022)

Paper: “N-HiTS: Hierarchical Time Series Interpolation Forecasting Based on Neural Networks”

N-HiTS (Hierarchical Time Series Forecasting based on Neural Networks) is a deep learning forecasting model meticulously developed by the Uber team, targeting multi-layer time series data. This model fully leverages the advantages of deep learning technology, accurately forecasting future trends of multi-layer time series data such as product sales, traffic, and stock prices.

The N-HiTS model adopts a unique hierarchical structure design, achieving in-depth insights into different time granularities and features by finely segmenting the entire time series data. At each layer, the model employs advanced neural network models for predictions, ensuring it captures the inherent correlations and trend changes between data across layers.

Additionally, N-HiTS introduces an adaptive learning algorithm that dynamically adjusts the structure and parameters of the prediction model based on the actual data characteristics, maximizing prediction accuracy. The application of this innovative technology enables N-HiTS to excel in addressing complex and dynamic time series forecasting tasks.

In summary, the N-HiTS model, with its advanced hierarchical structure design, neural network models, and adaptive learning algorithms, provides an efficient and accurate solution for multi-layer time series forecasting tasks. In the future, as big data and artificial intelligence technologies continue to evolve, N-HiTS is expected to showcase its powerful application potential in more fields.

model = NHiTSModel(input_chunk_length=30, output_chunk_length=15, n_epochs=100, num_stacks=3, num_blocks=1, num_layers=2, dropout=0.1, activation='ReLU')

2.4.9 D-Linear (2022)

Paper: Are Transformers Effective for Time Series Forecasting?

Code: https://github.com/cure-lab/LTSF-Linear

D-Linear (Deep Linear Model) is a neural network time series forecasting model developed by the team led by Hongyi Li, designed to gain insights into time series data in a linear manner. It cleverly utilizes neural network structures for linear predictions, ensuring high predictive capability while significantly enhancing the model’s interpretability. This model embeds a multilayer perceptron (MLP) architecture and continuously optimizes model performance through alternating training and fine-tuning strategies. Additionally, D-Linear introduces a feature selection mechanism based on sparse coding, intelligently filtering out features with discernibility and predictive value. Similarly, Baidu’s team developed N-Linear (Neural Linear Model), which is also a neural network-based linear time series forecasting model.

【Disclaimer】The content published by this public account is for learning and communication purposes only; the copyright of the content belongs to the original author. If your rights are infringed, please contact us in a timely manner, and we will delete the content as soon as possible. The content represents the author’s personal views and does not represent the position of this public account or take responsibility for its authenticity.

Overview of Deep Learning Models and Their Principles

2.1 RNN Class

2.2 CNN Class

2.3 Attention Class

Leave a Comment Cancel reply