Recently, Hugging Face has a very popular book titled “nlp-with-transformers”, and we will be updating practical tutorials related to transformers, so let’s get hands-on learning!
Original text: https://www.oreilly.com/library/view/natural-language-processing/9781098103231/ch01.html
Warning ahead, there will be a strong flavor of translation, so please enjoy while it’s fresh.
Hello Transformers
In 2017, researchers at Google published a paper proposing a novel neural network architecture for sequence modeling. This architecture, known as the Transformer, outperformed recurrent neural networks (RNNs) in machine translation tasks, both in translation quality and training cost.
At the same time, an effective transfer learning method called ULMFiT demonstrated that training long short-term memory (LSTM) networks on a very large and diverse corpus could produce state-of-the-art text classifiers with very few labeled data.
These advances catalyzed the two most famous Transformers today: Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT). By combining the Transformer architecture with unsupervised learning, these models eliminated the need to train task-specific architectures from scratch and broke almost all benchmarks in NLP with significant advantages. Since the release of GPT and BERT, a plethora of Transformers models have emerged. Figure 1-1 shows the timeline of the development of the brightest models.
[External image transfer failed, the source site may have anti-leech mechanisms, it is recommended to save the image]
We are transcending ourselves. To understand the novelty of Transformers, we first need to explain:
-
Encoder-Decoder Framework -
Attention Mechanism -
Transfer Learning
In this chapter, we will introduce the core concepts that underpin the widespread presence of Transformers, visit some tasks they excel at, and finally look at the tools and libraries of the Hugging Face ecosystem.
Let’s first explore the encoder-decoder framework and the architectures that existed before the rise of Transformers.
Encoder-Decoder Framework
Before the Transformer models, recursive architectures like LSTMs were the state-of-the-art in NLP. These architectures contain a feedback loop within the network connections, allowing information to propagate from one step to another, making them ideal for modeling sequential data such as text. As shown on the left side of Figure 1-2, RNNs receive some input (which could be a word or character), process it through the network, and output a vector called the hidden state. At the same time, the model feeds some information back to itself through the feedback loop, which can then be used in the next step. If we “unroll” the loop, we can see this more clearly, as shown on the right side of Figure 1-2. RNNs pass their state information from each step to the next operation in the sequence. This allows RNNs to keep track of information from previous steps and use it for output predictions.

These architectures have been (and will continue to be) widely used in NLP tasks, speech processing, and time series. You can find a brilliant exposition of their capabilities in Andrej Karpathy’s blog post “The Unreasonable Effectiveness of Recurrent Neural Networks”.
One area where RNNs play a crucial role is in the development of machine translation systems, which aim to map sequences of words in one language to sequences in another language. This type of task is typically solved using an encoder-decoder or sequence-to-sequence structure, which is well-suited for cases where both the input and output are sequences of arbitrary length. The encoder’s job is to encode the information of the input sequence into a numerical representation, usually referred to as the final hidden state. This state is then passed to the decoder, which generates the output sequence.
In general, the encoder and decoder components can be any neural network architecture capable of modeling sequences. This is illustrated in Figure 1-3 with a pair of RNNs, where the English sentence “Transformers are great!” is encoded into a hidden state vector and then decoded to produce the German translation “Transformer sind grossartig!” Input words are fed into the encoder one by one, and output words are generated from top to bottom.

Although the simplicity of this structure is elegant, one of its weaknesses is that the final hidden state of the encoder creates an information bottleneck. It must represent the meaning of the entire input sequence, as this is all the decoder can access when generating the output. This is especially challenging for long sequences, as information from the earlier parts of the sequence may be lost in the process of compressing everything into a single, fixed representation.
Fortunately, there is a way to escape this bottleneck by allowing the decoder to access all of the hidden states of the encoder. The general mechanism for this is called attention, which is a key component of many modern neural network architectures. Understanding how attention was developed for RNNs will enable us to understand one of the main components of the Transformer architecture well. Let’s delve deeper into this.
Attention Mechanism
The main idea behind attention is that the encoder does not produce a single hidden state for the input sequence, but instead outputs a hidden state at each step, allowing the decoder to access it. However, using all states simultaneously would present a massive input to the decoder, so some mechanism is needed to determine the priority of which states to use. This is where attention comes in. It allows the decoder to assign different weights, or “attention”, to the states of the encoder at each decoding time step. This process is illustrated in Figure 1-4, where the role of attention is to predict the third token in the output sequence.

By focusing on which input tokens are most relevant at each time point, these attention-based models can learn the non-linear alignments between words in the generated translation and words in the source sentence. For example, Figure 1-5 visually displays the attention weights of an English-French translation model, where each pixel represents a weight. The figure shows how the decoder correctly aligns the words “zone” and “Area”, which are ordered differently in the two languages.

Figure 1-5. Alignment of the English RNN encoder-decoder and the generated French translation (provided by Dzmitry Bahdanau).
While attention improves the quality of translations, using recursive models for both the encoder and decoder still poses a significant drawback. The computation is inherently sequential and cannot be parallelized across the input sequence.
With the advent of Transformers, a new modeling paradigm was introduced. They rely entirely on a special form of attention called self-attention, without needing recurrence at all. We will delve deeper into self-attention in Chapter 3, but the basic idea is to allow attention to operate on all states within the same layer of the neural network. This is shown in Figure 1-6, where both the encoder and decoder have their own self-attention mechanisms, and their outputs are fed into a feed-forward neural network (FFN). This architecture trains much faster than recursive models, paving the way for many recent breakthroughs in NLP.

Figure 1-6. The encoder-decoder structure of the original Transformer
In the original Transformer paper, the translation model was trained from scratch on large sentence-pair corpora in various languages. However, in many practical applications of NLP, we cannot obtain a large amount of labeled text data to train our models. To kickstart the Transformer revolution, the final piece was still missing: transfer learning.
Transfer Learning in NLP
Nowadays, in the field of computer vision, it is common practice to train convolutional neural networks like ResNet using transfer learning and then adapt or fine-tune them for a new task. This enables the network to leverage the knowledge learned from the original task. Architecturally, this involves splitting the model into a body and a head, where the head is a task-specific network. During training, the weights of the body learn broad features from the source domain, which are then used to initialize a new model for the new task. Compared to traditional supervised learning, this approach usually yields high-quality models that can be trained more efficiently on various downstream tasks, with significantly less labeled data. A comparison of these two approaches can be seen in Figure 1-7.

In computer vision, models are first trained on large-scale datasets like ImageNet, which contains millions of images. This process is called pre-training, and its main purpose is to help the model understand the basic features of images, such as edges or colors. Then, these pre-trained models can be fine-tuned on downstream tasks, such as classifying flower species with relatively few labeled instances (usually a few hundred per class). Fine-tuned models typically achieve higher accuracy than supervised models trained from scratch on the same amount of labeled data.
While transfer learning has become a standard method in computer vision, for many years it was unclear what the analogous pre-training process was for NLP. As a result, NLP applications often required large amounts of labeled data to achieve high performance. Even then, this performance could not compare to the results achieved in the visual domain.
In 2017 and 2018, several research groups proposed new methods that ultimately made transfer learning work in NLP. It began with insights from researchers at OpenAI, who achieved strong performance on sentiment classification tasks by using features extracted from unsupervised pre-training. This was followed by ULMFiT, which introduced a general framework that made pre-trained LSTM models applicable to various tasks.
As shown in Figure 1-8, ULMFiT consists of three main steps:
-
Pre-training The initial training objective is quite simple: predict the next word based on the preceding words. This task is called language modeling. The elegance of this approach lies in its independence from labeled data, allowing one to leverage vast amounts of text from sources like Wikipedia.
-
Domain Adaptation Once the language model has been pre-trained on a large corpus, the next step is to adapt it to a domain-specific corpus (e.g., from Wikipedia to IMDb movie reviews). This phase still uses language modeling, but now the model must predict the next word in the target corpus.
-
Fine-tuning In this step, the language model is fine-tuned through the classification layer of the target task (e.g., classifying sentiment in movie reviews).

By introducing a viable pre-training and transfer learning framework in NLP, ULMFiT provided the missing piece to launch the Transformers. In 2018, two Transformer models that combined self-attention with transfer learning were released.
-
GPT uses only the decoder part of the Transformer architecture, along with the same language modeling approach as ULMFiT. GPT was pre-trained on the BookCorpus, which consists of 7,000 unpublished books across various genres, including adventure, fantasy, and romance.
-
BERT uses the encoder part of the Transformer architecture, along with a special form of language modeling called masked language modeling. The goal of masked language modeling is to predict randomly masked words in the text. For example, given a sentence like “I looked at my [MASK] and found that [MASK] was late,” the model needs to predict the most likely candidates for the masked words represented by [MASK]. BERT was pre-trained on the BookCorpus and English Wikipedia.
GPT and BERT set new technical standards for various NLP benchmarks and ushered in the era of Transformers.
However, because different research labs released their models in incompatible frameworks (PyTorch or TensorFlow), it was not easy for NLP practitioners to port these models into their applications. With the release of Transformers, a unified API spanning over 50 architectures was gradually established. This library catalyzed an explosive growth in Transformer research and quickly permeated the NLP community, making it easy today to integrate these models into many real-life applications. Let’s take a look!
Hugging Face Transformers: Bridging the Gap
Applying a new machine learning architecture to a new task can be a complex job, typically involving the following steps:
-
Implement the model architecture in code, usually based on PyTorch or TensorFlow. -
Load pre-trained weights from a server (if available). -
Preprocess the inputs, pass them to the model, and apply some task-specific post-processing. -
Implement a data loader and define loss functions and optimizers to train the model.
Each of these steps requires custom logic tailored for each model and task. Traditionally (but not always!), when a research group publishes a new paper, they also release the code along with the model weights. However, this code is rarely standardized and often requires days of engineering to adapt to new use cases.
This is where Hugging Face Transformers comes to the rescue for NLP practitioners! It provides a standardized interface to connect to a wide range of Transformer models, along with code and tools to adapt these models to new use cases. The library currently supports three major deep learning frameworks (PyTorch, TensorFlow, and JAX) and allows you to switch easily between them. Additionally, it offers task-specific libraries, enabling you to fine-tune transformers on downstream tasks like text classification, named entity recognition, and question answering with ease. This reduces the time practitioners spend training and testing a handful of models from a week to an afternoon! This is a great example.
You will see this in the next section, where we show that with just a few lines of code, Transformers can be applied to solve some of the most common NLP applications you might encounter in the wild.
Transformers Application Examples
Every NLP task starts with a piece of text, like the following customer feedback about an online order, which is fabricated.
text = """Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""
Depending on your application, the text you are dealing with might be a legal contract, a product description, or something entirely different. In the case of customer feedback, you might want to know whether the feedback is positive or negative. This task is called sentiment analysis, which is part of the broader topic of text classification that we will explore in Chapter 2. Now, let’s take a look at what is needed to extract sentiment from our text using Transformers.
Text Classification
As we will see in the following chapters, Transformers have a hierarchical API that allows you to interact with the library at different levels of abstraction. In this chapter, we will start with a pipeline that abstracts all the steps needed to convert raw text into a set of predictions from a fine-tuned model.
In Transformers, we instantiate a pipeline by calling the pipeline() function and providing the name of the task we are interested in.
from transformers import pipeline
classifier = pipeline("text-classification")
The first time you run this code, you will see some progress bars as the pipeline automatically downloads the model weights from the Hugging Face Hub. When you instantiate the pipeline a second time, the library will notice that you have already downloaded the weights and will use the cached version instead. By default, the text classification pipeline uses a model designed for sentiment analysis, but it also supports multi-class and multi-label classification.
Now that we have our pipeline, let’s generate some predictions! Each pipeline accepts a string of text (or a list of strings) as input and returns a list of predictions. Each prediction is a Python dictionary, so we can display them nicely as a DataFrame using Pandas.
import pandas as pd
outputs = classifier(text)
pd.DataFrame(outputs)

In this case, the model is very confident that the text has a negative sentiment, which makes sense as we are dealing with a complaint from an angry customer. Note that for the sentiment analysis task, this pipeline only returns one of the positive or negative labels, as the other can be inferred by calculating the score.
Now let’s look at another common task: identifying named entities in the text.
Named Entity Recognition (NER)
Predicting the sentiment of customer feedback is a good first step, but you often want to know whether the feedback is about a specific product or service. In NLP, real-world objects like products, locations, and people are referred to as named entities, and extracting them from text is called named entity recognition (NER). We can apply NER by loading the corresponding pipeline and inputting our customer review.
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

You can see that the pipeline detected all the entities and assigned a category to each one, such as ORG (organization), LOC (location), or PER (person). Here, we used the aggregation_strategy parameter to group words into entities based on the model’s predictions. For example, the entity “Optimus Prime” consists of two words but is assigned to a single category, MISC (miscellaneous). The scores tell us how confident the model is about the entities it recognized. We can see that it is least confident about “Decepticons” and the first occurrence of “Megatron”, which it failed to group as a single entity.
Note:
Did you see those strange hash symbols (#) in the word column of the previous table? These are produced by the model’s tokenizer, which splits words into atomic units called tokens. You will learn all about tokenization in Chapter 2.
Extracting all named entities from text is good, but sometimes we want to ask more targeted questions. This is where we can use question answering.
Question Answering Systems
In question answering, we provide the model with a piece of text called context and a question we want to extract the answer to. The model then returns the text span corresponding to the answer. Let’s see what we get when we ask a specific question about the customer feedback.
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

We can see that along with the answer, the pipeline also returns the start and end integers, which correspond to the character indices of the answer span found (just like NER tags). There are several types of question answering, which we will explore in Chapter 7. The type of question answering shown above is called extractive question answering because the answer is directly extracted from the text.
With this approach, you can quickly read and extract relevant information from customer feedback. But what if you receive a mountain of lengthy complaints and don’t have time to read through them all? Let’s see how summarization models can help!
Summarization
The goal of text summarization is to take a long piece of text as input and generate a short version containing all the relevant facts. This is a much more complex task than before, as it requires the model to produce coherent text. In a pattern that should be familiar by now, we can instantiate a summarization pipeline as follows:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])
Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.
This summary isn’t too bad! While some content from the original is replicated, the model is able to grasp the essence of the issue and correctly identify that “Bumblebee” (mentioned at the end) is the author of the complaint. In this example, you can also see that we passed some keyword arguments to the pipeline, like max_length and clean_up_tokenization_spaces, which allow us to adjust the output at runtime.
But what happens when the feedback you receive is described in a language you don’t understand? You could use Google Translate, or you could use your own Transformers to translate for you!
Machine Translation
Like summarization, translation is a task where the output consists of generated text. Let’s use a translation pipeline to translate an English text into German.
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])
Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur vo Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Anbei sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, bald von Ihnen zu hören. Aufrichtig, Bumblebee.
Again, the model produced a very good translation, correctly using the formal pronouns in German, such as “Ihrem” and “Sie”. Here we also showed how to override the default model in the pipeline, selecting the best model for your application—you can find thousands of models for various language pairs on the Hugging Face Hub. Before we take a step back to look at the entire Hugging Face ecosystem, let’s check out one last application.
Text Generation
Suppose you want to provide faster responses to customer feedback by accessing an autocomplete feature. With a text generation model, you could do so as follows:
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])
Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee. Customer service response: Dear Bumblebee, I am sorry to hear that your order was mixed up. The order was completely mislabeled, which is very common in our online store, but I can appreciate it because it was my understanding from this site and our customer service of the previous day that your order was not made correct in our mind and that we are in a process of resolving this matter. We can assure you that your order
Okay, maybe we wouldn’t want to use this completion to appease Bumblebee, but you get the general idea.
Now that you’ve seen some cool applications of Transformers models, you might wonder where training happens. All the models we used in this chapter are publicly available and have been fine-tuned for the tasks at hand. However, generally speaking, you would want to fine-tune the models on your own data, and in the following chapters, you will learn how to do that.
However, training a model is just a small part of any NLP project—being able to effectively handle data, share results with colleagues, and make your work reproducible are also key components. Fortunately, Transformers are surrounded by a large ecosystem of useful tools that support most modern machine learning workflows. Let’s take a look.
The Hugging Face Ecosystem
Starting from Hugging Face Transformers, it has rapidly evolved into an entire ecosystem of many libraries and tools to accelerate your NLP and machine learning projects. The Hugging Face ecosystem mainly consists of two parts. As shown in Figure 1-9, a family of libraries and the Hub (model repository). The libraries provide code, while the Hub offers pre-trained model weights, dataset scripts, evaluation metrics, and more. In this section, we will briefly look at the various components. We will skip Transformers since we have already discussed it, and we will see more throughout the book.

Hugging Face Model Hub
As mentioned earlier, transfer learning is one of the key factors driving the success of Transformers because it makes it possible to reuse pre-trained models to complete new tasks. Therefore, being able to quickly load pre-trained models and use them for experimentation is crucial.
The Hugging Face Hub hosts over 20,000 models available for free. As shown in Figure 1-10, there are filters for tasks, frameworks, and datasets to help you browse the Hub and quickly find promising candidates. As we saw in the pipeline, loading a promising model into your code actually requires just one line of code. This makes experimenting with a wide range of models simple and allows you to focus on specific parts of your project.

In addition to model weights, the model hub also hosts dataset scripts and metrics scripts that allow you to reproduce published results or leverage additional data for your applications.
The model hub also provides model and dataset cards to document the contents of models and datasets and help you make informed decisions about whether they are suitable for you. One of the coolest features of the Hub is that you can try any model directly through interactive components for various specific tasks, as shown in Figure 1-11.

Let’s continue our journey with Tokenizers.
Note
Both PyTorch and TensorFlow also provide their own model hubs, and if a specific model or dataset is not available in the Hugging Face model hub, it is worth checking their own model hubs.
Hugging Face Tokenizers
Behind every pipeline instance we have seen in this chapter is a tokenization step that splits the raw text into smaller pieces called tokens. We will see the details of this work in Chapter 2, but for now, just know that tokens can be words, parts of words, or even characters like punctuation. Transformers models are trained on these numerical representations of tokens, so doing this step well is quite important for the entire NLP project.
Tokenizers provide many tokenization strategies and are very fast in tokenizing text due to their Rust backend (modules developed in Rust, fully compiled, similar to C/C++ languages). They also handle all preprocessing and postprocessing steps, such as normalizing inputs and transforming model outputs into the desired format. With tokenizers, we can load tokenizers in the same way we load pre-trained model weights with Transformers.
We need datasets and metrics to train and evaluate models, so let’s take a look at the Hugging Face datasets responsible for this.
Hugging Face Datasets
Loading, processing, and storing datasets can be a tedious process, especially when datasets become too large to fit into your laptop’s RAM. Additionally, you often need to implement various scripts to download data and convert it into standard formats.
Hugging Face Datasets simplifies this process by providing a standard interface to find thousands of datasets available on the Hub. It also offers smart caching (so you don’t have to preprocess again every time you run your code) and avoids RAM limitations by utilizing a special mechanism called memory mapping, which stores the contents of files in virtual memory, allowing multiple processes to modify a file more efficiently. The library also interoperates with popular frameworks like Pandas and NumPy, so you don’t have to leave the comfort of your favorite data processing tools.
However, having a good dataset and a strong model is worthless if you cannot reliably measure performance. Unfortunately, classic NLP metrics have many different implementations that may vary slightly, leading to misleading results. By providing scripts for many metrics, Datasets helps make experiments more reproducible and results more trustworthy.
With the Transformers, Tokenizers, and Datasets libraries, we have everything we need to train our own Transformer models. However, as we will see in Chapter 10, in some cases, we may need fine-grained control over the training loop. This is where the last library in the ecosystem, Accelerate, comes into play.
Hugging Face Accelerate
If you’ve ever had to write your own training scripts in PyTorch, then you may have felt some headaches when trying to port code running on your laptop to run on your organization’s cluster. Accelerate adds an abstraction layer to your normal training loop, responsible for handling all the custom logic required for training infrastructure. This effectively speeds up your workflow by simplifying changes to the infrastructure when necessary.
This summarizes the core components of the Hugging Face open-source ecosystem. But before we end this chapter, let’s look at some common challenges encountered when trying to deploy Transformers models in the real world.
Main Challenges of Using Transformers
-
Language
English is the primary target language for NLP research, and while there are models for several other languages, it is challenging to find pre-trained models for small or low-resource languages. In Chapter 4, we will explore multilingual Transformers and their ability to perform zero-shot cross-lingual transfer.
-
Data Availability
While we can use transfer learning to significantly reduce the amount of labeled training data required by models, this is still a large gap compared to the amount of data needed for humans to perform tasks. Chapter 9 focuses on addressing situations with little to no labeled data.
-
Long Text Issues
Self-attention models perform well on paragraphs of shorter text, but they become very costly when we turn to longer texts (like entire documents). Chapter 11 discusses ways to mitigate this.
-
Opacity
Like other deep learning models, Transformers are largely opaque. It is difficult or impossible to explain the “reason” behind a model making a certain prediction. This poses a particularly difficult challenge when deploying these models to make critical decisions. We will explore some methods for probing Transformers models’ errors in Chapters 2 and 4.
-
Bias
Transformers models are primarily pre-trained on text data from the internet. This embeds all the biases present in the data into the models. Ensuring these models are not racist, sexist, or worse is a challenging task. We will discuss some of these issues in more detail in Chapter 10.
Despite being daunting, many of these challenges can be overcome. Besides the specific chapters mentioned, we will touch on these topics in almost every chapter to come.
Summary
In the upcoming chapters, you will learn how to adapt Transformers to a wide range of use cases, such as building a text classifier, or a lightweight model for production, or even training a language model from scratch. We will take a hands-on approach, which means that every concept involved will have corresponding code you can run on Google Colab or your own GPU machine.
Now that we have grasped the fundamental concepts behind Transformers, it’s time to get hands-on with our first application: text classification. That is the topic of the next chapter!

About AINLP
AINLP is an interesting public account focused on AI, NLP, machine learning, deep learning, recommendation algorithms, etc. The topics include text summarization, intelligent Q&A, chatbots, machine translation, automatic generation, knowledge graphs, pre-trained models, recommendation systems, computational advertising, recruitment information, job-seeking experience sharing, etc. Welcome to follow! To join the technical group chat, please add AINLPer (id: ainlper), noting work/research direction + purpose of joining group.
Having read this far, please share, like, or take a look at one of the three 🙏