Constructing Vocabulary with torchtext.vocab

Machine Learning Algorithms and Natural Language Processing(ML-NLP) is one of the largest natural language processing communities in China and abroad, gathering over 500,000 subscribers, with an audience covering NLP master’s and doctoral students, university teachers, and corporate researchers.The vision of the community is to promote communication and progress between the academic and industrial worlds of natural language processing and enthusiasts. Reprinted from | Moon Inn

1 Introduction

When modeling related models for text corpora, one unavoidable operation is constructing a vocabulary. Generally, a common process is: ① tokenize the original corpus; ② use Counter to count the results after tokenization and remove words that appear less than a certain threshold; ③ based on the count results, create a dictionary and a list to convert words to indices and indices back to words.

Although this part of the code is not too much, we can also directly construct a vocabulary from the text using a more convenient method.

2 Constructing the Vocabulary

Before formally constructing the vocabulary, we need to define a tokenize method, i.e., how to split the original text. For example, whether to tokenize by word or by character in Chinese.

2.1 Defining the Tokenizer

If processing a corpus similar to English, it can be split directly by spaces. However, it is important to also separate commas and periods. Therefore, this part of the code can be implemented as follows:

def my_tokenizer(s):
    """
    Returns the result after tokenization
    """
    s = s.replace(',', " ,").replace(".", " .")
    return s.split()

As we can see, it is quite simple. For example, for the following text:

Two young, White males are outside near many bushes.

The result after tokenization is:

['Two', 'young', ',', 'White', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']

Of course, if it is a Chinese corpus, it can be implemented as follows:

import jieba
def tokenizer(s, word=False):
    if word:
        r = [w for w in s]
    else:
        s = jieba.cut(s, cut_all=False)
        r = " ".join(s).split()
    return r

Here, word=True indicates splitting by character. For example, for the following text:

问君能有几多愁，恰似一江春水向东流。

The result after tokenization is:

# Split by word
['问君', '能', '有', '几多', '愁', '，', '恰似', '一江春水向东流', '。']

# Split by character
['问', '君', '能', '有', '几', '多', '愁', '，', '恰', '似', '一', '江', '春', '水', '向', '东', '流', '。']

2.2 Building the Vocabulary

After introducing the implementation method of tokenization, we can formally construct a dictionary using the Vocab method in torchtext.vocab, as shown in the following code:

def build_vocab(tokenizer, filepath, word, min_freq, specials=None):
    if specials is None:
        specials = ['&lt;unk&gt;', '&lt;pad&gt;', '&lt;bos&gt;', '&lt;eos&gt;']
    counter = Counter()
    with open(filepath, encoding='utf8') as f:
        for string_ in f:
            counter.update(tokenizer(string_.strip(), word))
    return Vocab(counter, min_freq=min_freq, specials=specials)

In the code above, the third line specifies special characters; lines 5-7 traverse each sample in the file (one per line) for tokenization and counting, and line 8 returns the final dictionary.

For example, when constructing a vocabulary based on the following 3 lines of samples:

问君能有几多愁，恰似一江春水向东流。
年年岁岁花相似，岁岁年年人不同。
人面不知何处去，桃花依旧笑春风。

The result will be as follows:

if __name__ == '__main__':
    filepath = './data.txt'
    vocab = build_vocab(tokenizer, filepath, word=True, min_freq=1,
                        specials=['&lt;unk&gt;', '&lt;pad&gt;', '&lt;bos&gt;', '&lt;eos&gt;'])
    print(vocab.freqs)  # Get a dictionary returning the frequency of each word in the corpus;
    # Counter({'年': 4, '岁': 4, '，': 3, '。': 3, '似': 2, '春': 2, '花': 2,...})
    print(vocab.itos)  # Get a list returning each word in the vocabulary;
    # ['&lt;unk&gt;', '&lt;pad&gt;', '&lt;bos&gt;', '&lt;eos&gt;', '岁', '年', '。', '，', '不', '人',...]
    print(vocab.itos[2])  # Return the corresponding word in the vocabulary by index;
    # &lt;bos&gt;
    print(vocab.stoi)  # Get a dictionary returning the index of each word in the vocabulary;
    # {'&lt;unk&gt;': 0, '&lt;pad&gt;': 1, '&lt;bos&gt;': 2, '&lt;eos&gt;': 3, '岁': 4, '年': 5,...}
    print(vocab.stoi['岁'])  # Return the corresponding index in the vocabulary by word
    print(vocab['岁']) # Same as above
    # 4
    print(len(vocab))# Return the length of the vocabulary
    # 39

From the above example, we can see that after obtaining the constructed vocabulary class, we can easily get the corresponding results through the respective methods.

After introducing the construction of the vocabulary, let’s also briefly introduce how to merge two dictionaries, which is the usage of counter.update.

3 Merging Dictionaries and Lists

3.1 Merging Dictionaries

In some scenarios, we need to merge two (or more) dictionaries. For example, we need to merge the following two dictionaries:

dict1 = {"a": 2, "b": 3, "c": 5}
dict2 = {"a": 1, "c": 3, "d": 8}

And the result after merging would be:

{'c': 8, 'd': 8, 'a': 3, 'b': 3}

So how should we operate? Since two dictionaries cannot be directly added, we first need to convert each dictionary into a Counter class and then add them. The specific code is as follows:

from collections import Counter
dict1 = {"a": 2, "b": 3, "c": 5}
dict2 = {"a": 1, "c": 3, "d": 8}
result = Counter({})
for item in [dict1, dict2]:
   result += Counter(item)
print(result) # Counter({'c': 8, 'd': 8, 'a': 3, 'b': 3})

Of course, if just adding two dictionaries, it can be done in one line:

result = Counter(dict1) + Counter(dict2)

3.2 Merging Lists

In some scenarios, we need to merge two (or more) lists to obtain a dictionary containing the frequency of each element. For example, we need to merge the following two lists:

a = ["天", "之", "道", "损", "有", "余", "而", "补", "不", "足"]
b = ["人", "之", "道", "损", "不", "足", "而", "补", "有", "余"]

To merge into:

Counter({'之': 2, '道': 2, '损': 2, '有': 2, '余': 2, '而': 2, '补': 2, '不': 2, '足': 2, '天': 1, '人': 1})

This can be achieved with the following code:

from collections import Counter
counter = Counter()
for item in [a, b]:
    counter.update(item)
print(counter)

Of course, this method can also be used not only when constructing a vocabulary but also when searching for or counting duplicate elements in lists.

4 Conclusion

In this article, I first introduced how to perform tokenization on text; then introduced how to construct a dictionary using the Vocab method in torchtext.vocab; finally, I briefly introduced how to merge dictionaries and lists using Counter. In the next article, I will introduce how to quickly build the corresponding text dataset based on the constructed vocabulary.

About UsMachine Learning Algorithms and Natural Language Processing(ML-NLP) is a grassroots academic community jointly built by natural language processing scholars from home and abroad. It has now developed into one of the largest natural language processing communities, gathering over 500,000 subscribers, and includes well-known brands such as Top Conference Group, AI Selection, AI Talent Exchange, and AI Academic Exchange, aimed at promoting progress between the academic and industrial worlds of natural language processing and enthusiasts.The community can provide an open communication platform for related practitioners’ further education, employment, and research. Everyone is welcome to pay attention to and join us.

Prompt—From CLIP to CoOp, New Paradigm of Visual-Language Models

Focal Loss — From Intuition to Implementation

CPU vs GPU, Who is the Elementary Student?