Click on the “MLNLP” above, and select the “Star” public account
Heavyweight content delivered to you first
Reprinted from the public account: Harbin Institute of Technology SCIR
Authors: Harbin Institute of Technology SCIR Di Donglin Liu Yuanxing Zhu Qingfu Hu Jingwen
Introduction
With the development of artificial intelligence, more and more deep learning frameworks have emerged, such as PyTorch, TensorFlow, Keras, MXNet, Theano, and PaddlePaddle. These foundational frameworks provide the basic toolkit needed to build a model. However, for NLP-related tasks, we often need to write a lot of tedious code ourselves, including tools for data preprocessing and the training process. Therefore, people usually write their own models based on NLP-related deep learning frameworks, such as OpenNMT, ParlAI, and AllenNLP. With the help of these frameworks, one can quickly implement training and prediction for basic NLP tasks. However, when we need to modify the basic tasks, we are often constrained by the encapsulation of the code, making it difficult to proceed.Thus, this article mainly focuses on how to use frameworks to implement custom models, helping everyone quickly understand the usage of these frameworks.
We will first introduce Tensor2Tensor, a widely used framework in the NLP/CV field, which provides common basic models used in NLP/CV. Next, we will introduce AllenNLP, which is developed on the PyTorch platform and provides a unified development architecture for NLP models. Then, we will discuss two important subfields in NLP, the frameworks commonly used in neural machine translation and dialogue systems, OpenNMT and ParlAI. Through the introduction of these four frameworks, we hope to help everyone understand the usage of NLP frameworks in different development platforms and fields.
Framework Name | Application Field | Development Platform |
---|---|---|
Tensor2Tensor | NLP/CV | TensorFlow |
AllenNLP | NLP | PyTorch |
OpenNMT | NLP – Machine Translation | PyTorch/TensorFlow |
ParlAI | NLP – Dialogue | PyTorch |
1. Tensor2Tensor
Tensor2Tensor[1] is a comprehensive library based on TensorFlow, which includes some basic models for CV and NLP, such as LSTM, CNN, etc., as well as some more advanced models like various GANs and Transformers. It supports various tasks in NLP comprehensively and is very easy to get started with.
As this resource library is still under continuous development, it has so far had 3897 commits, 66 release versions, and 178 contributors. In the 2018 paper “Attention is All You Need”, this repository provided the official version of the Transformer model, which was gradually completed by other platform architectures later.
Note: There may be local changes during the iterative updates of the version.
1. Install CUDA 9.0 (must be 9.0, cannot be 9.2)
2. Install TensorFlow (currently 1.12)
3. Install Tensor2Tensor (refer to the official website for installation)
1. Data Preprocessing
This step involves writing some preprocessing code according to your task, such as string formatting, generating feature vectors, etc.
2. Write Custom Problem:
-
Write the custom problem code, and make sure to add a decorator (@registry.registry_problem) before the custom class name.
-
The class name of the custom problem must be in camel case, and the py file name must be in underscore style, corresponding to the class name.
-
It must inherit from the parent class problem. T2T has already provided a problem for generating data, and you need to find the corresponding parent class for your problem classification (you can run t2t-datagen to see the problem list).
-
You must import the custom problem file in the
<strong>__init__</strong>.py
file.
3. Use t2t-datagen to convert your preprocessed data into T2T formatted datasets 【Note the path】
-
Run t2t-datagen –help or t2t-datagen –helpfull. For example:
cd scripts && t2t-datagen --t2t_usr_dir=./ --data_dir=../train_data --tmp_dir=../tmp_data --problem=my_problem
-
If the output format of the custom problem code is incorrect, this command will report an error.
4. Use t2t-trainer to train with the formatted dataset
-
Run t2t-trainer –help or t2t-trainer –helpfull. For example:
cd scripts && t2t-trainer --t2t_usr_dir=./ --problem=my_problem --data_dir=../train_data --model=transformer --hparams_set=transformer_base --output_dir=../output --train_steps=20 --eval_steps=100
5. Use t2t-decoder to predict on the test set 【Note the path】
-
If you want to use the results from a specific checkpoint, you need to modify the last number in the first line of the checkpoint file: model_checkpoint_path: “model.ckpt-xxxx”. For example:
cd scripts && t2t-decoder --t2t_usr_dir=./ --problem=my_problem --data_dir=../train_data --model=transformer --hparams_set=transformer_base --output_dir=../output --decode_hparams="beam_size=5,alpha=0.6" --decode_from_file=../decode_in/test_in.txt --decode_to_file=../decode_out/test_out.txt
6. Use t2t-exporter to export the trained model
7. Analyze the results
# coding=utf-8
from tensor2tensor.utils import registry
from tensor2tensor.data_generators import problem, text_problems
@registry.register_problem
class AttentionGruFeature(text_problems.Text2ClassProblem):
ROOT_DATA_PATH = '../data_manager/'
PROBLEM_NAME = 'attention_gru_feature'
@property
def is_generate_per_split(self):
return True
@property
def dataset_splits(self):
return [{
"split": problem.DatasetSplit.TRAIN,
"shards": 5,
}, {
"split": problem.DatasetSplit.EVAL,
"shards": 1,
}]
@property
def approx_vocab_size(self):
return 2 ** 10 # 8k vocab suffices for this small dataset.
@property
def num_classes(self):
return 2
@property
def vocab_filename(self):
return self.PROBLEM_NAME + ".vocab.%d" % self.approx_vocab_size
def generate_samples(self, data_dir, tmp_dir, dataset_split):
del data_dir
del tmp_dir
del dataset_split
with open('{}self_antecedent_generate_sentences.pkl'.format(self.ROOT_DATA_PATH), 'rb') as f:
_sentences = pickle.load(f)
for _sent in _sentences:
yield {
"inputs": _sent.input_vec_attention_feature,
"label": _sent.antecedent_label
}
# PROBLEM_NAME='attention_gru_feature'
# DATA_DIR='../train_data_atte_feature'
# OUTPUT_DIR='../output_atte_feature'
# t2t-datagen --t2t_usr_dir=. --data_dir=$DATA_DIR --tmp_dir=../tmp_data --problem=$PROBLEM_NAME
# t2t-trainer --t2t_usr_dir=. --data_dir=$DATA_DIR --problem=$PROBLEM_NAME --model=transformer --hparams_set=transformer_base --output_dir=$OUTPUT_DIR
T2T is a simple entry-level framework provided by Google, built by a large community of enthusiasts, which encapsulates TF at a lower level and can meet most CV and NLP tasks. Many mainstream mature models have already been implemented. By directly inheriting or implementing some predefined interfaces within the framework, many tasks can be completed. It is very friendly for beginners, and the documentation is updated in a timely manner. By carefully reading the documentation (or reading error messages), you can understand and use this framework, making it convenient for the reproduction of many non-innovative models.
2. AllenNLP
AllenNLP is a NLP research library based on PyTorch, which provides developers with various industry-best training models for language tasks. The official website provides a great introductory tutorial[2], allowing beginners to understand how to use AllenNLP in just 30 minutes.
Since AllenNLP has already implemented many tedious preprocessing and training frameworks for us, we only need to write the following:
Here is an example code for DatasetReader.
from typing import Dict, Iterator
from allennlp.data import Instance
from allennlp.data.fields import TextField
from allennlp.data.dataset_readers import DatasetReader
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import WordTokenizer, Tokenizer
@DatasetReader.register('custom')
class CustomReader(DatasetReader):
def __init__(self, tokenizer: Tokenizer = None, token_indexers: Dict[str, TokenIndexer] = None) -> None:
super().__init__(lazy=False)
self.tokenizer = tokenizer or WordTokenizer()
self.word_indexers = token_indexers or {"word": SingleIdTokenIndexer('word')}
def text_to_instance(self, _input: str) -> Instance:
fields = {}
tokenized_input = self.tokenizer.tokenize(_input)
fields['input'] = TextField(tokenized_input, self.word_indexers)
return Instance(fields)
def _read(self, file_path: str) -> Iterator[Instance]:
with open(file_path) as f:
for line in f:
yield self.text_to_instance(line)
First, you need to customize the <span>_read</span>
function to define how to read the dataset, returning a constructed <span>instance</span>
as needed through the <span>yield</span>
statement. Then, use the <span>text_to_instance</span>
function to convert the text into an <span>instance</span>
. In the <span>text_to_instance</span>
function, you need to tokenize the input text and then construct the <span>field</span>
.
<span>self.tokenizer</span>
is used to split the text into tokens, with options for word-level or character-level tokenization. <span>self.word_indexers</span>
is used to index tokens and convert them to tensors. There are many types of TokenIndexers, and before implementing your model, you can check the official documentation for types that may suit your needs. If you need to build multiple vocabularies, such as source and target language vocabularies, you need to define another <span>self.word_indexers</span>
here. Different indexers in the vocabulary are distinguished by the namespace initialized by the <span>SingleIdTokenIndexer</span>
function, which is the last part of line 15: <span>'word'</span>
.
Similar to how models are implemented in PyTorch, but note that:
<span>@Model.register('')</span>
allows you to use JsonNet for model selection (if you have multiple models, you can directly modify the Json values to switch without manually changing the code).
Since AllenNLP encapsulates the Trainer, we need to implement or select existing evaluation metrics within the model, so that the evaluation metrics will be automatically calculated during training. The specific method is to define the evaluation function in the <span>__init__</span>
method, which you can find in the official documentation[3]. If not, you need to write it yourself.
self.acc = CategoricalAccuracy()
Then, call the evaluation function in the <span>forward</span>
method to compute the metrics
self.acc(output, labels)
Finally, return the corresponding metrics in the <span>get_metrics</span>
method as a dict result.
def get_metrics(self, reset: bool = False) -> Dict[str, float]:
return {"acc": self.acc.get_metric(reset)}
Generally, you can directly call the AllenNLP Trainer method to automatically start training. However, if you have some special training steps, such as GAN[4], you cannot simply use the AllenNLP Trainer; you need to open the Trainer to iterate through each step, which you can refer to in the writing of the trainer in [4].
For learning code about AllenNLP, you can refer to [5]. Since AllenNLP is based on PyTorch, its coding style is basically consistent with PyTorch, so if you know how to use PyTorch, there should be no obstacles in getting started with AllenNLP. The code comments are quite comprehensive, and the module encapsulation is flexible. The code of AllenNLP is very easy to modify, just like using pure PyTorch. Of course, flexibility also means many complex implementations, and AllenNLP currently does not have many implementations, so most may need to be written by yourself. AllenNLP relies on many Python libraries and is also being updated recently.
3. OpenNMT
OpenNMT[6] is an open-source neural machine translation project that adopts the commonly used encoder-decoder structure, and can also be used for tasks such as text summarization and response generation. Currently, this project has developed versions for both PyTorch and TensorFlow, allowing users to choose as needed. This article takes the PyTorch version[7] as an example for introduction.
As a typical machine translation framework, OpenNMT’s data mainly consists of source and target parts, corresponding to the source language input and the target language translation in machine translation. OpenNMT uses the Field data structure from TorchText to represent each part. During user customization, if you need to add additional data beyond source and target, you can refer to the construction method of the source field or target field, such as constructing a custom user_data:
fields["user_data"] = torchtext.data.Field(
init_token=BOS_WORD, eos_token=EOS_WORD,
pad_token=PAD_WORD,
include_lengths=True)
Where init_token, eos_token, and pad_token are the user-defined start character, end character, and padding character, respectively. When include_lengths is true, it will return both the processed data and the lengths of the data.
OpenNMT implements an attention mechanism encoder-decoder model. The framework defines interfaces for the encoder and decoder, under which various structures of encoder-decoder can be implemented for users to combine as needed, such as CNN, RNN encoders, etc. If users need to customize specific structural modules, they can also design according to this interface to ensure that the resulting module can be combined with other modules in OpenNMT. The encoder-decoder interface is as follows:
class EncoderBase(nn.Module):
def forward(self, input, lengths=None, hidden=None):
raise NotImplementedError
class RNNDecoderBase(nn.Module):
def forward(self, input, context, state, context_lengths=None):
raise NotImplementedError
The training in OpenNMT is controlled by the Trainer class in Trainer.py, which does not have a high degree of customization and only implements the basic sequence-to-sequence training process. For complex training processes such as multi-task and adversarial training, significant modifications to this class are required.
OpenNMT provides different implementations based on the two major frameworks, PyTorch and TensorFlow, which can meet the needs of most users. The encapsulation of the basic framework results in a loss of some flexibility, but for text generation tasks under the encoder-decoder structure, it can save time on details like data formatting and interface definitions, allowing users to focus more on their custom modules to quickly build the required models.
4. ParlAI
ParlAI is a platform developed by Facebook that focuses on dialogue tasks, sharing, training, and evaluating dialogue models on many dialogue tasks[8]. This platform can be used to train and test dialogue models, conduct multi-task training on many datasets, and integrates Amazon Mechanical Turk for data collection and human evaluation.
The basic concepts in ParlAI are:
-
The world defines the environment in which agents interact with each other. The world must implement a parley method. Each call to parley results in an interaction, usually involving one action from each agent.
-
An agent can be a person, a simple robot that repeats anything it hears, a perfectly tuned neural network, a dataset that reads data, or anything else that may send messages or interact with its environment. Agents have two main methods they need to define:
def observe(self, observation): # Updates internal state with observation
def act(self): # Generates action based on internal state
-
Observations are objects returned by the act function of the agent and are named because they are input into the observe function of other agents. This is the primary way messages are passed between agents and the environment in ParlAI. Observations typically take the form of Python dictionaries containing different types of information.
-
A teacher is a special type of agent. They implement the act and observe functions like all agents, but they also track metrics returned through the report function, such as the number of questions they asked or the number of correct answers they provided.
The code in ParlAI contains several main folders[9]:
-
core contains the main code of the framework;
-
agents contains agents that can interact with different tasks;
-
examples contains basic examples of various loops;
-
tasks contains code for different tasks;
-
mturk contains the code for setting up Mechanical Turk and sample MTurk tasks.
ParlAI internally encapsulates many dialogue tasks (such as ConvAI2) and evaluations (such as F1 scores and hits@1, etc.). Using existing data, code, and models in ParlAI for training and evaluation can quickly implement many baseline models in dialogue. However, due to the strong encapsulation of the code, it is not recommended to use it to build your model from scratch. If you want to build your model based on it, you can refer to the detailed tutorials on the official website[10].
Here is a simple example of training and evaluating a dialogue model using the internal data, code, and models (Train a Transformer on Twitter):
1. Print some examples from the dataset
python examples/display_data.py -t twitter
# display first examples from twitter dataset
2. Train the model
python examples/train_model.py -t twitter -mf /tmp/tr_twitter -m transformer/ranker -bs 10 -vtim 3600 -cands batch -ecands batch --data-parallel True
# train transformer ranker
3. Evaluate the previously trained model
python examples/eval_model.py -t twitter -m legacy:seq2seq:0 -mf models:twitter/seq2seq/twitter_seq2seq_model
# Evaluate seq2seq model trained on twitter from our model zoo
4. Output some predictions from the model
python examples/display_model.py -t twitter -mf /tmp/tr_twitter -ecands batch
# display predictions for model saved at specific file on twitter
ParlAI has its own set of patterns, such as world, agent, teacher, etc. The code encapsulation is particularly good, and the codebase is large. If you want to find an intermediate result, you need to look through layers of function calls, which makes modification difficult. ParlAI encapsulates many existing baseline models, allowing dialogue researchers to quickly implement baseline models. Currently, ParlAI is still being updated, and the structure of the code may vary slightly between different versions, but the core usage methods of ParlAI remain largely the same.
5. Conclusion
This article introduces methods for building custom models using four common frameworks. Tensor2Tensor is comprehensive but only supports TensorFlow. The biggest advantage of AllenNLP is that it simplifies the processes of data preprocessing, training, and prediction. The code is also flexible to modify, but some tools have not yet been implemented by the official team and need to be written by oneself. For traditional encoder-decoder structured text generation tasks, using OpenNMT can save a lot of time. However, if the model structure is more novel, building a model using OpenNMT can still be quite challenging. ParlAI internally encapsulates many dialogue tasks, making it easy for users to quickly reproduce relevant baseline models. However, due to its strong code encapsulation and unique patterns, building a model from scratch using ParlAI presents certain challenges. Each framework has its own advantages and disadvantages, so users should choose based on their own situations and usage methods. However, it is not recommended to try every framework, as mastering each one requires a certain time investment.
References
[1] https://github.com/tensorflow/tensor2tensor
[2] https://allennlp.org/tutorials
[3]https://allenai.github.io/allennlp-docs/api/allennlp.training.metrics.html
[4]http://www.realworldnlpbook.com/blog/training-a-shakespeare-reciting-monkey-using-rl-and-seqgan.html
[5] https://github.com/mhagiwara/realworldnlp
[6] http://opennmt.net/
[7] https://github.com/OpenNMT/OpenNMT-py
[8]http://parl.ai.s3-website.us-east-2.amazonaws.com/docs/tutorial_quick.html
[9]https://www.infoq.cn/article/2017/05/ParlAI-Facebook-AI
[10]http://parl.ai.s3-website.us-east-2.amazonaws.com/docs/tutorial_basic.html
Recommended Reading:
Practical | Pytorch BiLSTM + CRF for NER
How to evaluate the fastText algorithm proposed by the author of Word2Vec? Does deep learning have no advantage in simple tasks like text classification?
From Word2Vec to BERT, discussing the past and present of word vectors (Part 1)