How BERT Tokenizes Text

Follow the official account “ML_NLP“

Set as “Starred“, delivering heavy content promptly!

Source | Zhihu

Link | https://zhuanlan.zhihu.com/p/132361501

Author | Alan Lee

Editor | Machine Learning Algorithms and Natural Language Processing Public Account

This article is authorized and reposting is prohibited

This article was first published on my personal blog on 2019/10/16 and cannot be reproduced without permission.

BERT stands for Bidirectional Encoder Representations from Transformers, a language representation model released by Google in 2018. Upon its release, this model became a subject of imitation, and many of you might have heard or studied it to some extent. This article focuses primarily on BERT’s tokenization method, with details on model implementation discussed later.

The tokenization.py in the BERT source code is the program used for preprocessing and tokenization, which mainly includes two tokenizers: BasicTokenizer and WordpieceTokenizer. Additionally, there is a FullTokenizer which combines the two: first, it uses BasicTokenizer to obtain a relatively coarse token list, and then applies WordpieceTokenizer to each token to get the final tokenization result.

To visually demonstrate the effect of each processing step, I will use the following example that is modified from the Keras Wikipedia introduction:

example = "Keras是ONEIROS（Open-ended Neuro-Electronic Intelligent Robot Operating System，开放式神经电子智能机器人操作系统）项目研究工作的部分产物[3]，主要作者和维护者是Google工程师François Chollet。\r\n"

In summary for Chinese: BERT adopts a “character-based” approach, meaning each Chinese character is separated.

`BasicTokenizer`

BasicTokenizer (hereinafter referred to as BT) is a preliminary tokenizer. For a string to be tokenized, the process roughly consists of converting to unicode -> removing various strange characters -> processing Chinese -> whitespace tokenization -> removing redundant characters and punctuation tokenization -> whitespace tokenization again, ending the process.

This is the general process, and there are many details that I will explain next.

Convert to Unicode

The conversion to unicode corresponds to the convert_to_unicode(text) function, which is straightforward: it converts the input into a unicode string. If you are using Python 3 and the input is of str type, you don’t have to worry; the input and output will be the same. If you are using Python 3 and the input type is bytes, this function will use text.decode("utf-8", "ignore") to convert it to unicode. If you are using Python 2, please refer to Sunsetting Python 2 support and Python 2.7 Countdown; just drop it.

After this step, example remains the same:

>>> example = convert_to_unicode(example)
>>> example
'Keras是ONEIROS（Open-ended Neuro-Electronic Intelligent Robot Operating System，开放式神经电子智能机器人操作系统）项目研究工作的部分产物[3]，主要作者和维护者是Google工程师François Chollet。\r\n'

Remove Various Strange Characters

Removing various strange characters corresponds to the _clean_text(text) method of the BT class, which uses unicode code points to remove illegal characters and redundant spaces, including:

In Python, you can use ord(c) to get the code point of character c, and chr(i) to get the Unicode character with code point i, where $0 \leq i \leq ext{0x10ffff}$, i.e., decimal $[0, 1114111]$.

Code point 0 \x00, i.e., null character (Null character), or end character, invisible to the naked eye, belongs to control characters, usually at the end of a string. Note that it is not a space; the code point for space is 32.
Code point 0xfffd (decimal 65533) �, i.e., replacement character (REPLACEMENT CHARACTER), usually used to replace unknown, unrecognized, or unrepresentable characters.
Control characters other than \t, \r, and \n (Control character), i.e., characters whose unicode category is Cc and Cf. You can use unicodedata.category(c) to check the unicode category of c. The code uses _is_control(char) to determine if char is a control character.
Convert all whitespace characters to a single space, including standard space, \t, \r, \n, and characters with unicode category Zs. The code uses _is_whitespace(char) to determine if char is a whitespace character.

After this step, example has \r\n replaced with two spaces:

>>> example = _clean_text(example)
>>> example
'Keras是ONEIROS（Open-ended Neuro-Electronic Intelligent Robot Operating System，开放式神经电子智能机器人操作系统）项目研究工作的部分产物[3]，主要作者和维护者是Google工程师François Chollet。'

Process Chinese Characters

Processing Chinese characters corresponds to the _tokenize_chinese_chars(text) method of the BT class. For characters in text, it first checks if they are “Chinese characters” (see the explanation in the quote block below), and if so, adds a space before and after them; otherwise, it outputs them as they are. Now the question arises: how do we determine if a character is “Chinese”?

The method _is_chinese_char(cp) checks whether cp is the code point, determining it by the code point. There are a total of 81520 characters, and the detailed code point range is as follows (all are closed intervals):

[0x4E00, 0x9FFF]: Decimal [19968, 40959]
[0x3400, 0x4DBF]: Decimal [13312, 19903]
[0x20000, 0x2A6DF]: Decimal [131072, 173791]
[0x2A700, 0x2B73F]: Decimal [173824, 177983]
[0x2B740, 0x2B81F]: Decimal [177984, 178207]
[0x2B820, 0x2CEAF]: Decimal [178208, 183983]
[0xF900, 0xFAFF]: Decimal [63744, 64255]
[0x2F800, 0x2FA1F]: Decimal [194560, 195103]

Actually, I think this range can be further simplified because some intervals are adjacent, such as the following three intervals:

[0x2A700, 0x2B73F]: Decimal [173824, 177983]
[0x2B740, 0x2B81F]: Decimal [177984, 178207]
[0x2B820, 0x2CEAF]: Decimal [178208, 183983]

These can be simplified into one:

[0x2A700, 0x2CEAF]: Decimal [173824, 183983]

The original 8 intervals are simplified to 6. As for why it was originally written as 8, I don’t know.

Regarding “Chinese characters”: according to the definition in the code, “Chinese characters” refer to characters in the CJK Unicode block, including modern Chinese, some Japanese, some Korean, and Vietnamese. However, according to the definition of the CJK Unicode block, these characters only include those within the first code point interval ([0x4E00, 0x9FFF]), meaning that the characters in the code are far more than those included in the CJK Unicode block. This point is currently somewhat questionable. I will quote the source code’s comment on this:

def _is_chinese_char(self, cp):
    """Checks whether CP is the codepoint of a CJK character."""
    # This defines a "chinese character" as anything in the CJK Unicode block:
    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    #
    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
    # despite its name. The modern Korean Hangul alphabet is a different block,
    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
    # space-separated words, so they are not treated specially and handled
    # like the all of the other languages.

    pass

After this step, the Chinese is split into characters, separated by spaces, while English letters and numbers remain unchanged:

>>> example = _tokenize_chinese_chars(example)
>>> example
'Keras 是 ONEIROS（Open-ended Neuro-Electronic Intelligent Robot Operating System， 开  放  式  神  经  电  子  智  能  机  器  人  操  作  系  统 ） 项  目  研  究  工  作  的  部  分  产  物 [3]， 主  要  作  者  和  维  护  者  是 Google 工  程  师 François Chollet。'

Whitespace Tokenization

Whitespace tokenization corresponds to the whitespace_tokenize(text) function. First, it performs strip() on text to remove extra whitespace characters from both ends. If the remaining string is empty, it directly returns an empty list; otherwise, it performs split() to obtain the initial tokenization result orig_tokens.

After this step, example transforms into a list:

>>> example = whitespace_tokenize(example)
>>> example
['Keras', '是', 'ONEIROS（Open-ended', 'Neuro-Electronic', 'Intelligent', 'Robot', 'Operating', 'System，', '开', '放', '式', '神', '经', '电', '子', '智', '能', '机', '器', '人', '操', '作', '系', '统', '）', '项', '目', '研', '究', '工', '作', '的', '部', '分', '产', '物', '[3]，', '主', '要', '作', '者', '和', '维', '护', '者', '是', 'Google', '工', '程', '师', 'François', 'Chollet。']

Remove Redundant Characters and Punctuation Tokenization

The next step is to further process the orig_tokens tokenization result, as shown in the code below:

for token in orig_tokens:
      if self.do_lower_case:
        token = token.lower()
        token = self._run_strip_accents(token)
      split_tokens.extend(self._run_split_on_punc(token))

The logic is not complicated. Here, I mainly want to explain _run_strip_accents and _run_split_on_punc.

_run_strip_accents(text) method is used to remove accents, i.e., diacritical marks. So what are diacritical marks? They are symbols added to letters to change their pronunciation or to distinguish similarly spelled words. For example, the two dots above the pinyin letter “ü” in Chinese, or the accents on letters like “á” and “à”.

Diacritical marks, also known as accents, are symbols added above letters to change their pronunciation or to distinguish similarly spelled words. Common accented characters can be found in Common accented characters.

The _run_strip_accents(text) method removes these accents, for example, François Chollet becomes Francois Chollet, résumé becomes resume, and á becomes a. The method code is not long, as shown below:

def _run_strip_accents(self, text):
    """Strips accents from a piece of text."""
    text = unicodedata.normalize("NFD", text)
    output = []
    for char in text:
      cat = unicodedata.category(char)
      if cat == "Mn":
        continue
      output.append(char)
    return "".join(output)

Using list comprehension, the code can be further simplified:

def _run_strip_accents(self, text):
    """Strips accents from a piece of text."""
    text = unicodedata.normalize("NFD", text)
    output = [char for char in text if unicodedata.category(char) != 'Mn']
    return "".join(output)

The core of this code is the unicodedata.normalize and unicodedata.category functions. The former returns the canonical decomposition of the input string text (Unicode characters have various canonical forms; this article defaults to NFD form, i.e., canonical decomposition), while the latter returns the unicode category of the input character char. Below, I will illustrate the effect of these two functions.

Suppose we want to process āóǔè, which contains diacritical marks. This character is actually composed of two characters, such as ā (code point 0x101) which consists of a (code point 0x61) and the horizontal line above (code point 0x304). By using unicodedata.normalize, we can split these two:

>>> import unicodedata  # unicodedata is a built-in library
>>> s = 'āóǔè'
>>> s_norm = unicodedata.normalize('NFD', s)
>>> s_norm, len(s_norm)

(‘āóǔè’, 8) # It looks the same as before, but the length has changed

unicodedata.category

is used to return the category of each character:

>>> ‘ ‘.join(unicodedata.category(c) for c in s_norm)

‘Ll Mn Ll Mn Ll Mn Ll Mn’

href="https://www.compart.com/en/unicode/category/Ll">Ll category represents Lowercase Letter, while href="https://www.compart.com/en/unicode/category/Mn">Mn category represents Nonspacing Mark, diacritical marks belong to this category. Thus, we can directly remove diacritical marks based on category:

>>> ''.join(c for c in s_norm if unicodedata.category(c) != 'Mn')
'aoue'

_run_split_on_punc(text) is punctuation tokenization, which tokenizes based on punctuation marks.

_run_split_on_punc(text) method is applied to each token after the previous whitespace tokenization.

Before discussing this method, let’s first talk about the function that checks whether a character is a punctuation mark: _is_punctuation(char). The code for this function is short, and I will provide it below:

def _is_punctuation(char):
  """Checks whether `chars` is a punctuation character."""
  cp = ord(char)
  if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
      (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
    return True
  cat = unicodedata.category(char)
  if cat.startswith("P"):
    return True
  return False

Typically, we would use a file similar to a dictionary to store all punctuation marks, while the _is_punctuation function determines punctuation based on code points, making it more flexible and eliminating the need to maintain an extra dictionary file. There are two situations where characters are considered punctuation: characters in ASCII other than letters and digits, and characters in the Unicode category starting with P. The first situation has a total of 32 characters, as listed below:

!"#$%&'
()*+,-./:;<=>?@["]^_`{|}~

The overall process of _run_split_on_punc is as follows:

1. Set start_new_word=True and output=[], where output is the final output.
2. For each character in text, if the character is punctuation, append [char] to output and set start_new_word=True.
3. If it is not punctuation and start_new_word=True, it indicates the start of a new segment, so append an empty list to output, then set start_new_word = False and append the current character to the last appended empty list: output[-1].append(char).

Now, the output is a nested list, where each list is a segment separated by punctuation. Finally, we can join each list to flatten output.

After this step, words and punctuation that were not previously separated (e.g., ONEIROS（Open-ended) and unremoved diacritical marks (e.g., ç) are processed accordingly:

>>> example
['keras', '是', 'oneiros', '（', 'open', '-', 'ended', 'neu', '##ro', '-', 'electronic', 'intelligent', 'robot', 'operating', 'system', '，', '开', '放', '式', '神', '经', '电', '子', '智', '能', '机', '器', '人', '操', '作', '系', '统', '）', '项', '目', '研', '究', '工', '作', '的', '部', '分', '产', '物', '[', '3', ']', '，', '主', '要', '作', '者', '和', '维', '护', '者', '是', 'google', '工', '程', '师', 'francois', 'chollet', '。']

Whitespace Tokenization Again

This corresponds to the following code:

output_tokens = whitespace_tokenize(" ".join(split_tokens))

It is quite simple: first, it joins the previous processing result with standard spaces, then performs whitespace tokenization again. (But WHY?)

After this step, the result is the same as the previous step:

>>> example
['keras', '是', 'oneiros', '（', 'open', '-', 'ended', 'neuro', '-', 'electronic', 'intelligent', 'robot', 'operating', 'system', '，', '开', '放', '式', '神', '经', '电', '子', '智', '能', '机', '器', '人', '操', '作', '系', '统', '）', '项', '目', '研', '究', '工', '作', '的', '部', '分', '产', '物', '[', '3', ']', '，', '主', '要', '作', '者', '和', '维', '护', '者', '是', 'google', '工', '程', '师', 'francois', 'chollet', '。']

This is the final output of BT.

`WordpieceTokenizer`

WordpieceTokenizer (hereinafter referred to as WPT) performs another round of segmentation based on the results of BT, obtaining subwords (subword, starting with ##). The vocabulary is introduced at this time. This class has only two methods: an initialization method __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200) and a tokenization method tokenize(self, text).

For Chinese, whether or not to use WPT makes no difference, as Chinese has already been tokenized into individual characters after BasicTokenizer, making it impossible to further segment into “subwords”.

__init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): vocab is the vocabulary, of type collections.OrderedDict(), loaded from load_vocab(vocab_file), where the key is the vocabulary and the value is the corresponding index, ordered according to the sequence in vocab_file. One thing to note is that the vocabulary already includes all possible subwords. unk_token is the token for unknown words, defaulting to [UNK]. max_input_chars_per_word is the maximum length of a single word; if a word exceeds this maximum length, it is directly set to unk_token.

tokenize(self, text): This is the main tokenization method, which generally splits a word into multiple subwords in a left-to-right order, with each subword being as long as possible. According to the source code, this method is referred to as the greedy longest-match-first algorithm.

Initially, it converts text to unicode and performs whitespace tokenization, then iterates through each word. To clearly and intuitively understand the iteration process, I have specially created a GIF to explain it, using unaffable as an example:

Note:

The blue background indicates the current substring, corresponding to cur_substr in the code.
When traversing from the first position, there is no need to add ## before the current substring; otherwise, it is required.

Outline of the process (although I believe the above GIF is clear enough):

1. Starting from the first position, since it is the longest match, the ending position needs to decrement from the far right, so the first subword traversed is itself unaffable, which is not in the vocabulary.
2. Move the ending position left by one to get the subword unaffabl, which is also not in the vocabulary.
3. Repeat this operation until un, which is in the vocabulary, is added to output_tokens, ending the traversal from the first position.
4. Skip un and start a new traversal from a, still decrementing the ending position from the far right, but this time add ## in front, resulting in ##affable, which is not in the vocabulary.
5. Move the ending position left by one to get the subword ##affabl, which is also not in the vocabulary.
6. Repeat this operation until ##aff, which is in the vocabulary, is added to output_tokens, ending this round of traversal.
7. Skip aff and start a new traversal from a, still decrementing the ending position from the far right. ##able is in the vocabulary, so it is added to output_tokens.
No characters are left after able, and the entire traversal ends.

By inputting the results of BT into WPT, the final tokenization result for example is:

['keras', '是', 'one', '##iros', '（', 'open', '-', 'ended', 'neu', '##ro', '-', 'electronic', 'intelligent', 'robot', 'operating', 'system', '，', '开', '放', '式', '神', '经', '电', '子', '智', '能', '机', '器', '人', '操', '作', '系', '统', '）', '项', '目', '研', '究', '工', '作', '的', '部', '分', '产', '物', '[', '3', ']', '，', '主', '要', '作', '者', '和', '维', '护', '者', '是', 'google', '工', '程', '师', 'franco', '##is', 'cho', '##llet', '。']

Thus, the tokenization part of BERT concludes.

Reference

[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
bert/tokenization.py at master · google-research/bert
How to replace accented characters in python? – Stack Overflow
What is the best way to remove accents in a Python unicode string? – Stack Overflow
Accents & Accented Characters – Fonts.com | Fonts.com
Common accented characters | Butterick’s Practical Typography

Important! A WeChat group for academic discussions on Natural Language Processing has been established.

You can scan the QR code below to join the group for discussions.

Please modify your remarks to [School/Company + Name + Direction] when adding.

For example — Harbin Institute of Technology + Zhang San + Dialogue System.

We kindly ask business accounts to refrain from joining. Thank you!

Recommended Reading:
【In-depth Explanation】From Transformer to BERT Model
Sai Er Translation | Understanding Transformer from Scratch
Seeing is Believing! A Step-by-Step Guide to Building a Transformer with Python

BasicTokenizer