From AlexNet to BERT: A Simple Review of Key Ideas in Deep Learning

Follow the WeChat public account “ML_NLP“

Set as “Starred“, heavy content delivered at the first time!

From AlexNet to BERT: A Simple Review of Key Ideas in Deep Learning

Source | Big Data Digest

Translation | Ao🌰viYa, Meat Bun, Andy

This article by Denny Britz summarizes the important ideas in deep learning over time, recommended for newcomers, listing almost all the key ideas since 2012 that have supported countless papers. They are as follows:

AlexNet and Dropout: AlexNet directly opened the deep learning era, laying the foundation for the basic structure of CNN models in CV. Dropout, needless to say, has become a basic configuration.
Deep Reinforcement Learning with Atari: A groundbreaking work in deep reinforcement learning, DQN opened a new path, leading to attempts in various games.
Seq2Seq + Attention: Its impact in the NLP field is undeniable. For a period, it was even said that any NLP task could be solved by Seq2Seq + Attention, and this actually laid the foundation for the subsequent pure Attention Transformer.
Adam Optimizer: No need to say more, a favorite for training models.
Generative Adversarial Networks (GANs): This became extremely popular starting from 2014, with various GANs being developed until the emergence of StyleGAN last year, which marked a kind of calm. The controversial Deepfake is one of its results, and recently it has been seen that people use it to create fake materials.
Residual Networks: Like Dropout and Adam, it has also become a basic configuration, making models deep.
Transformers: A pure Attention model, it directly replaced LSTM in NLP and has gradually achieved good results in other fields, also laying the foundation for the subsequent BERT pre-training model.
BERT and Fine-tuning NLP Models: Utilizing the highly scalable Transformer, with a large amount of data and a simple self-supervised training objective, it is possible to obtain a very powerful pre-trained model that sweeps various tasks. The recent one is GPT-3, which has shown various fancy demos online since the API was released.

The author will review some ideas in the deep learning field that have withstood the test of time and are widely used. Of course, it cannot cover everything. Even so, the deep learning technologies introduced below already encompass the basic knowledge needed to understand modern deep learning research. If you are a newcomer to this field, then great, this will be a very good starting point for you.

Deep learning is a rapidly changing field, and the vast amount of research papers and ideas may make it feel a bit overwhelming. Even experienced researchers sometimes find it hard to tell what the real breakthroughs are to the company’s PR. According to the saying, “Time is the only test of truth”, the author reviews those studies that have stood the test of time, which have been repeatedly used in various research and applications, with results that are evident.

If you want to get started right after reading this article, then you are thinking too simply. The best way is to understand and reproduce the classic papers mentioned below; this can give you a very good foundation and will also help you understand the latest research and carry out your own projects. Additionally, browsing the papers in chronological order as shown below is also useful to help you understand where the current technology comes from and why they were originally invented. In simple terms, the author will summarize as few studies as possible that cover most of the basic knowledge needed to understand modern deep learning research.

Regarding deep learning, one characteristic is its application fields, including machine vision, natural language processing, speech, and reinforcement learning. And these fields use similar technologies, for example: someone who has previously used deep learning for computer vision can quickly achieve results in NLP research. Even if specific network architectures are somewhat different, the concepts, methods, and codes are interconnected. This article will introduce some studies from different fields, but before getting to the point, a declaration is needed:

This article does not provide in-depth explanations or code examples for the studies mentioned below, as lengthy and complex papers are often difficult to summarize into a short paragraph. Instead, the author will only briefly outline these technologies and their historical background, providing links to their papers and implementations. If you want to learn something, it is best to reproduce the experiments in the papers from scratch using PyTorch without using existing codebases or high-level libraries.

Due to the author’s personal knowledge and familiarity with the fields, this list may not be comprehensive, as many noteworthy subfields are not mentioned. However, most mainstream fields recognized by the majority, including machine vision, natural language processing, speech, and reinforcement learning, are included.

Moreover, the author only discusses research with official or semi-official open-source implementations available. Some studies requiring substantial engineering efforts and are not easily reproducible, such as DeepMind’s AlphaGo or OpenAI’s Dota 2 AI, will not be discussed.

Some selections of research may seem somewhat arbitrary. There will always be some similar technologies released around the same time, and the purpose of this article is not to conduct a comprehensive review but to introduce various studies in different fields to newcomers. For example, GAN may have hundreds of variants, but regardless of which one you want to study, the basic concept of GAN is indispensable.

2012: Using AlexNet and Dropout to Process the ImageNet Dataset

Related Papers:

ImageNet Classification with Deep Convolutional Neural Networks [1]:

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

Improving neural networks by preventing co-adaptation of feature detectors[2]:

https://arxiv.org/abs/1207.0580

One weird trick for parallelizing convolutional neural networks [14]:

https://arxiv.org/abs/1404.5997

Implementation Code:

PyTorch Version:

https://pytorch.org/hub/pytorch_vision_alexnet/

TensorFlow Version:

https://github.com/tensorflow/models/blob/master/research/slim/nets/alexnet.py

From AlexNet to BERT: A Simple Review of Key Ideas in Deep Learning

Illustration Source: [1]

It is generally believed that AlexNet initiated the recent wave of deep learning and artificial intelligence research. AlexNet is actually a deep convolutional network based on Yann LeCun’s earlier proposed LeNet. Its uniqueness lies in the fact that AlexNet achieved a significant improvement by combining the powerful performance of GPUs with its superior algorithms, far surpassing previous methods for classifying the ImageNet dataset. It also proved that neural networks are indeed effective! AlexNet is also one of the first algorithms to use Dropout[2], which has since become a key component for improving the generalization ability of various deep learning models.

The AlexNet architecture consists of a series of modules made up of convolutional layers, nonlinear ReLU, and max pooling, which have now been accepted as the standard network structure for machine vision. Nowadays, with libraries like PyTorch being very powerful, implementing AlexNet has become very simple compared to some of the latest architectures, now achievable in just a few lines of code. It is worth noting that many implementations of AlexNet use a variant that incorporates a technique mentioned in the paper One weird trick for parallelizing convolutional neural networks.

2013: Using Deep Reinforcement Learning to Play Atari Games

Related Papers:

Playing Atari with Deep Reinforcement Learning [7]:

https://arxiv.org/abs/1312.5602

Implementation Code:

PyTorch Version:

https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

TensorFlow Version:

https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial

Illustration Source:

https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning

Based on recent developments in image recognition and GPUs, DeepMind successfully trained a neural network capable of playing Atari games based on raw pixel input. Moreover, the same neural network learned to play seven different games without setting any game rules, demonstrating the generality of the method.

YouTube Video:

https://www.youtube.com/watch?v=V1eYniJ0Rnk

In reinforcement learning, unlike supervised learning (such as image classification), the agent must learn to maximize the total reward over a period of time steps (e.g., a game), rather than just predicting labels. Since its agents can interact directly with the environment, and each action affects the next action, the training data is not independently and identically distributed. This also makes training many reinforcement learning models quite unstable, but this issue can be addressed using techniques like experience replay.

Although there are no apparent algorithmic innovations, this research cleverly combines various existing techniques, such as training convolutional neural networks on GPUs and experience replay, along with some data processing tricks, to achieve impressively unexpected results. This has also boosted confidence in expanding deep reinforcement learning techniques to solve more complex tasks, such as Go, Dota 2, StarCraft II, etc.

Since this paper, Atari games have also become a benchmark for reinforcement learning research. The initial methods, although surpassing human performance, could only achieve such performance in seven games. Over the next few years, these ideas were continuously expanded, defeating humans in more and more games. It wasn’t until recently that the technology conquered all 57 games and surpassed all human levels, with “Montezuma’s Revenge” being considered one of the hardest games due to its requirement for long-term planning.

2014: Encoder-Decoder Networks with Attention Mechanism (Seq2Seq + Attention Model)

Related Papers:

Sequence to Sequence Learning with Neural Networks [4]:

https://arxiv.org/abs/1409.3215

Neural Machine Translation by Jointly Learning to Align and Translate [3]:

https://arxiv.org/abs/1409.0473

Implementation Code:

PyTorch Version:

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

TensorFlow Version:

https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt

Illustration Source: Open Source Seq2Seq Framework in TensorFlow:

https://ai.googleblog.com/2017/04/introducing-tf-seq2seq-open-source.html

Many of the most impressive results in deep learning are related to vision tasks and driven by convolutional neural networks. Although the NLP field has seen some success with LSTMs and encoder-decoder architectures in language modeling and translation, it was not until the advent of the attention mechanism that the field achieved truly remarkable achievements.

When processing language, each token (which can be a character, word, or something in between) is fed into a recurrent network (such as LSTM), which stores previously processed inputs. In other words, it is like a time series sentence, where each token represents a time step. These recurrent models easily “forget” earlier inputs when processing sequences, making it difficult to handle long-distance dependencies. Since the gradients must propagate through many time steps, this can lead to issues such as gradient explosion and gradient vanishing, making it difficult to optimize recurrent models using gradient descent.

The introduction of the attention mechanism helps alleviate this issue by providing the network with a way to “recall” earlier time steps adaptively through direct connections. These connections allow the network to determine which important inputs are relevant when generating a specific output. To illustrate simply with translation: when generating an output word, one or more specific input words are usually selected by the attention mechanism as references for the output.

2014 – Adam Optimizer

Related Papers:

Adam: A Method for Stochastic Optimization [12]:

https://arxiv.org/abs/1412.6980

Implementation Code:

Python Version:

https://d2l.ai/chapter_optimization/adam.html

PyTorch Version:

https://pytorch.org/docs/master/_modules/torch/optim/adam.html

TensorFlow Version:

https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/optimizer_v2/adam.py#L32-L281

Y-axis – Probability of Optimal Solution

X-axis – Budget for Hyperparameter Optimization (#Model Training)

Source: http://arxiv.org/abs/1910.11758

Neural networks are generally trained by minimizing the loss function using optimizers, which are responsible for adjusting the network parameters to learn the specified target. Most optimizers are based on stochastic gradient descent (SGD) (https://ruder.io/optimizing-gradient-descent/) for improvement. However, it should be noted that many optimizers themselves also contain adjustable parameters, such as learning rates. Therefore, finding the right settings for specific problems can not only reduce training time but also find better local optima of the loss function, which often leads to better results for the model.

Previously, well-funded research labs typically had to run costly hyperparameter searches to devise a learning rate adjustment scheme for SGD. Although it could outperform previous best performances, it often meant spending a lot of money to tune the optimizer. These details are generally not mentioned in the papers, so researchers without similar budgets for tuning optimizers often get stuck with poorer results and have no recourse.

Adam brought good news to these researchers, as it can automatically adjust learning rates through the first and second moments of gradients. Moreover, experimental results have shown it to be very reliable and not very sensitive to hyperparameter selection. In other words, Adam can be used right away without needing extensive tuning like other optimizers. Although tuned SGD may yield better results, Adam has made research easier. Because when problems arise, you know it is unlikely to be due to hyperparameter tuning.

2014/2015 – Generative Adversarial Networks (GAN)

Related Papers:

Generative Adversarial Networks [6]:

https://arxiv.org/abs/1406.2661

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks [17]:

https://arxiv.org/abs/1511.06434

Implementation Code:

PyTorch Version:

https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html

TensorFlow Version:

https://www.tensorflow.org/tutorials/generative/dcgan

Figure 2: Visualization of Model Samples. The rightmost column shows the nearest samples to the training examples, proving that the model does not simply memorize the training set. The samples are randomly selected rather than carefully chosen. Unlike most other visualizations of deep generative models, these images show actual samples from the model distribution rather than the conditional means of given latent unit samples. Moreover, these samples are unrelated, as their sampling process does not rely on Markov chain mixing, a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-IO (convolutional discriminator and deconvolutional generator)

Source: https://developers.google.com/machine-learning/gan/gan_structure

Generative models (such as Variational Autoencoders, VAE) aim to generate realistic data samples, such as non-existent human faces. Here, the model must model the entire data distribution (many pixels!), rather than just classifying cats or dogs like discriminative models, making such models difficult to train. Generative Adversarial Networks (GANs) are such models.

The basic idea of GAN is to train two networks simultaneously, a generator and a discriminator. The generator’s goal is to produce samples that can deceive the discriminator, while the discriminator is trained to distinguish between real images and generated images. As training progresses, the discriminator becomes better at identifying fake images, while the generator becomes better at deceiving the discriminator, producing more realistic samples, which is where the “adversarial” aspect comes from. Initially, GANs produced blurry, low-resolution images and were quite unstable to train. However, with technological advancements, variants and improvements like DCGAN[17], Wasserstein GAN[25], CycleGAN[26], and StyleGAN(v2)[27] can produce higher resolution, realistic images and videos.

2015 – Residual Networks (ResNet)

Related Papers:

Deep Residual Learning for Image Recognition [13]:

https://arxiv.org/abs/1512.03385

Implementation Code:

PyTorch Version:

https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py

TensorFlow Version:

https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/applications/resnet.py

Researchers have invented architectures based on convolutional neural networks with better performance, such as VGGNet[28], Inception[29], based on AlexNet. ResNet is the most important breakthrough in this series of advancements. To this day, ResNet variants have been used as benchmark model architectures for various tasks and have been the foundation for more complex architectures.

The uniqueness of ResNet lies not only in its championship in the ILSVRC 2015 classification challenge but also in its depth compared to other network architectures. The network mentioned in the paper is the deepest with 1000 layers, although it performs slightly worse on benchmark tasks compared to the 101 and 152 layers, it still performs excellently. Training such a deep network is very challenging due to the gradient vanishing problem, and sequential models have the same issue. Previously, few researchers believed that training such a deep network could yield such stable results.

ResNet uses shortcut connections to aid gradient propagation. One way to understand it is that ResNet only needs to learn the “difference” from one layer to another, which is simpler than learning a complete transformation. Additionally, the residual connections in ResNet can be seen as a special case of Highway Networks[30], which were inspired by the gating mechanisms in LSTMs.

2017 – Transformers

Related Papers:

Attention is All You Need [5]:

https://arxiv.org/abs/1706.03762

Implementation Code:

PyTorch Version:

https://pytorch.org/tutorials/beginner/transformer_tutorial.html

TensorFlow Version:

https://www.tensorflow.org/tutorials/text/transformer

HuggingFace Transformers Library:

https://github.com/huggingface/transformers

Figure 1: Transformer – Model Architecture

Source: https://arxiv.org/abs/1706.03762

The Seq2Seq + Attention model (which has been introduced earlier) performs well, but its recursive nature requires sequential computation. Therefore, it is difficult to parallelize, processing only one step at a time, with each step relying on the previous one. This also makes it hard to apply to long sequential data, and even with attention mechanisms, it is still challenging to model complex long-distance dependencies, as most of its work is still implemented in recurrent layers.

Transformers directly address these issues by discarding the recurrent part and replacing it with multiple feed-forward self-attention layers, processing all inputs in parallel and finding relatively short (easier to optimize with gradient descent) paths between inputs and outputs. This makes its training speed very fast, easy to scale, and capable of handling more data. To incorporate input positional information (which is implicit in recurrent models), Transformers also use positional encoding. For more information on how Transformers work, it is recommended to read this illustrated blog.

(http://jalammar.github.io/illustrated-transformer/)

To merely say that Transformers perform better than almost everyone expected would be an insult to them. Because in the following years, they not only performed better, but also directly eliminated RNNs, becoming the standard architecture for most NLP and other sequence tasks, even being applied in machine vision.

2018 – BERT and Fine-tuned NLP Models

Related Papers:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [9]:

https://arxiv.org/abs/1810.04805

Implementation Code:

HuggingFace Implementation for Fine-tuning BERT:

https://huggingface.co/transformers/training.html

Pre-training means training a model to perform a certain task, then using the learned parameters as initialization parameters to learn a related task. This is quite intuitive; a model that has learned to classify cat or dog images should have learned some basics about images and furry animals. When this model is fine-tuned to classify foxes, it can be expected to perform better than a model that learns from scratch. Similarly, a model that has learned to predict the next word in a sentence should have learned some knowledge about human language models. Thus, its parameters will be a good initialization for related tasks (such as translation or sentiment analysis).

Pre-training and fine-tuning have been successful in both computer vision and NLP fields. Although in computer vision, it has long become the standard, how to better leverage it in the NLP field still seems to pose some challenges. Most of the best results still come from fully supervised models. With the emergence of methods like ELMo [34], ULMFiT [35], NLP researchers have finally begun to work on pre-training (previously, word vectors could also count), especially the application of Transformers, which has led to a series of methods like GPT and BERT.

BERT is a relatively new achievement in pre-training, and many people believe it has opened a new era in NLP research. Unlike most pre-training models that train to predict the next word, it predicts words that are masked (intentionally deleted) in a sentence, as well as whether two sentences are adjacent. Note that these tasks do not require labeled data; it can be trained on any text, and it can be a large amount of text! Thus, the pre-trained model can learn some general properties of the language, and then it can be fine-tuned to solve supervised tasks, such as question answering or sentiment prediction. BERT performs exceptionally well across various tasks and dominates the rankings upon release. Companies like HuggingFace have also capitalized on this trend, making it easy to download and use fine-tuned BERT models for NLP tasks. Subsequently, BERT has been continuously referenced by new models like XLNet[31], RoBERTa[32], and ALBERT[33], and now basically everyone in the field knows about it.

2019/2020 and Beyond – BIG Language Models, Self-supervised Learning?

Throughout the history of deep learning, the most obvious trend may be what Sutton called the bitter lesson. As stated in it, algorithms that can utilize better parallelism (more data) and have more model parameters consistently outperform so-called “smarter techniques”. This trend seems to continue into 2020, with OpenAI’s GPT-3 model, a massive language model with 175 billion parameters, which, despite its simple training objectives and architecture, has exhibited unexpected generalization (various highly effective demos).

Methods like contrastive self-supervised learning, such as SimCLR (https://arxiv.org/abs/2002.05709), which can better utilize unlabeled data, also share the same trend. As models become larger and training speeds increase, techniques that can effectively utilize vast amounts of unlabeled datasets online to learn transferable general knowledge are becoming increasingly valuable.

Related Reports:

https://dennybritz.com/blog/deep-learning-most-important-ideas/

References

[1] ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton (2012)

Advances in Neural Information Processing Systems 25

[http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf] [Semantic Scholar] [Google Scholar]

[2] Improving neural networks by preventing co-adaptation of feature detectors

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov (2012)

arXiv:1207.0580 \\[cs\\]

[http://arxiv.org/abs/1207.0580] [Semantic Scholar] [Google Scholar]

[3] Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2016)

arXiv:1409.0473 \\[cs, stat\\]

[http://arxiv.org/abs/1409.0473] [Semantic Scholar] [Google Scholar]

[4] Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014)

arXiv:1409.3215 \\[cs\\]

[http://arxiv.org/abs/1409.3215] [Semantic Scholar] [Google Scholar]

[5] Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017)

arXiv:1706.03762 \\[cs\\]

[http://arxiv.org/abs/1706.03762] [Semantic Scholar] [Google Scholar]

[6] Generative Adversarial Networks

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014)

arXiv:1406.2661 \\[cs, stat\\]

[http://arxiv.org/abs/1406.2661] [Semantic Scholar] [Google Scholar]

[7] Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller (2013)

arXiv:1312.5602 \\[cs\\]

[http://arxiv.org/abs/1312.5602] [Semantic Scholar] [Google Scholar]

[8] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle, Michael Carbin (2019)

arXiv:1803.03635 \\[cs\\]

[http://arxiv.org/abs/1803.03635] [Semantic Scholar] [Google Scholar]

[9] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2019)

arXiv:1810.04805 \\[cs\\]

[http://arxiv.org/abs/1810.04805] [Semantic Scholar] [Google Scholar]

[10] Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

[Semantic Scholar] [Google Scholar]

[11] Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020)

arXiv:2005.14165 \\[cs\\]

[http://arxiv.org/abs/2005.14165] [Semantic Scholar] [Google Scholar]

[12] Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba (2017)

arXiv:1412.6980 \\[cs\\]

[http://arxiv.org/abs/1412.6980] [Semantic Scholar] [Google Scholar]

[13] Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015)

arXiv:1512.03385 \\[cs\\]

[http://arxiv.org/abs/1512.03385] [Semantic Scholar] [Google Scholar]

[14] One weird trick for parallelizing convolutional neural networks

Alex Krizhevsky (2014)

arXiv:1404.5997 \\[cs\\]

[http://arxiv.org/abs/1404.5997] [Semantic Scholar] [Google Scholar]

[15] Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching

Long-Ji Lin (1992)

Machine Language

[https://doi.org/10.1007/BF00992699] [Semantic Scholar] [Google Scholar]

[16] Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber (1997)

Neural Computation

[https://doi.org/10.1162/neco.1997.9.8.1735] [Semantic Scholar] [Google Scholar]

[17] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, Soumith Chintala (2016)

arXiv:1511.06434 \\[cs\\]

[http://arxiv.org/abs/1511.06434] [Semantic Scholar] [Google Scholar]

[18] Mastering the game of Go with deep neural networks and tree search

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu (2016)

Nature

[https://www.nature.com/articles/nature16961] [Semantic Scholar] [Google Scholar]

[19] Convolutional Sequence to Sequence Learning

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin (2017)

arXiv:1705.03122 \\[cs\\]

[http://arxiv.org/abs/1705.03122] [Semantic Scholar] [Google Scholar]

[20] WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu (2016)

arXiv:1609.03499 \\[cs\\]

[http://arxiv.org/abs/1609.03499] [Semantic Scholar] [Google Scholar]

[21] The Arcade Learning Environment: An Evaluation Platform for General Agents

Marc G. Bellemare, Yavar Naddaf, Joel Veness, Michael Bowling (2013)

Journal of Artificial Intelligence Research

[http://arxiv.org/abs/1207.4708] [Semantic Scholar] [Google Scholar]

[22] First return then explore

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, Jeff Clune (2020)

arXiv:2004.12919 \\[cs\\]

[http://arxiv.org/abs/2004.12919] [Semantic Scholar] [Google Scholar]

[23] Agent57: Outperforming the Atari Human Benchmark

Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell (2020)

arXiv:2003.13350 \\[cs, stat\\]

[http://arxiv.org/abs/2003.13350] [Semantic Scholar] [Google Scholar]

[24] Optimizer Benchmarking Needs to Account for Hyperparameter Tuning

Prabhu Teja Sivaprasad, Florian Mai, Thijs Vogels, Martin Jaggi, François Fleuret (2020)

arXiv:1910.11758 \\[cs, stat\\]

[http://arxiv.org/abs/1910.11758] [Semantic Scholar] [Google Scholar]

[25] Wasserstein GAN

Martin Arjovsky, Soumith Chintala, Léon Bottou (2017)

arXiv:1701.07875 \\[cs, stat\\]

[http://arxiv.org/abs/1701.07875] [Semantic Scholar] [Google Scholar]

[26] Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros (2018)

arXiv:1703.10593 \\[cs\\]

[http://arxiv.org/abs/1703.10593] [Semantic Scholar] [Google Scholar]

[27] Analyzing and Improving the Image Quality of StyleGAN

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila (2020)

arXiv:1912.04958 \\[cs, eess, stat\\]

[http://arxiv.org/abs/1912.04958] [Semantic Scholar] [Google Scholar]

[28] Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, Andrew Zisserman (2015)

arXiv:1409.1556 \\[cs\\]

[http://arxiv.org/abs/1409.1556] [Semantic Scholar] [Google Scholar]

[29] Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich (2014)

arXiv:1409.4842 \\[cs\\]

[http://arxiv.org/abs/1409.4842] [Semantic Scholar] [Google Scholar]

[30] Training Very Deep Networks

Rupesh K Srivastava, Klaus Greff, Jürgen Schmidhuber (2015)

Advances in Neural Information Processing Systems 28

[http://papers.nips.cc/paper/5850-training-very-deep-networks.pdf] [Semantic Scholar] [Google Scholar]

[31] XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le (2020)

arXiv:1906.08237 \\[cs\\]

[http://arxiv.org/abs/1906.08237] [Semantic Scholar] [Google Scholar]

[32] RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov (2019)

arXiv:1907.11692 \\[cs\\]

[http://arxiv.org/abs/1907.11692] [Semantic Scholar] [Google Scholar]

[33] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut (2020)

arXiv:1909.11942 \\[cs\\]

[http://arxiv.org/abs/1909.11942] [Semantic Scholar] [Google Scholar]

[34] Deep contextualized word representations

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer (2018)

arXiv:1802.05365 \\[cs\\]

[http://arxiv.org/abs/1802.05365] [Semantic Scholar] [Google Scholar]

[35] Universal Language Model Fine-tuning for Text Classification

Jeremy Howard, Sebastian Ruder (2018)

arXiv:1801.06146 \\[cs, stat\\]

[http://arxiv.org/abs/1801.06146] [Semantic Scholar] [Google Scholar]



The repository address is shared:

Reply "code" in the backend of the machine learning algorithms and natural language processing public account to obtain 195 NAACL + 295 ACL2019 papers with open-source code. The open-source address is as follows: https://github.com/yizhen20133868/NLP-Conferences-Code

Heavyweight! The Yizhen Natural Language Processing - Pytorch group has been officially established! There are a lot of resources in the group, and everyone is welcome to join and learn! Note: Please modify the remarks when adding to [School/Company + Name + Direction] For example - HIT + Zhang San + Dialogue System. The account owner, micro-business please consciously avoid. Thank you!

Recommended Reading:
Longformer: A Pre-trained Model for Long Documents Beyond RoBERTa
A Visual Understanding of KL Divergence
Top 100 Must-Read Papers in Machine Learning: Highly Cited, Comprehensive Coverage | GitHub 21.4k Stars

Leave a Comment Cancel reply