Follow the public account “ML_NLP“

Set as “Starred“, heavy content delivered to you first!

From | cnblogs

Address | https://www.cnblogs.com/huangyc/p/10223075.html

Author | hyc339408769

Editor | Machine Learning Algorithms and Natural Language Processing Public Account

This article is for academic sharing only. If there is an infringement, please contact us to delete the article.

Complete machine learning implementation code on GitHub. Feel free to reprint, but please indicate the source: https://www.cnblogs.com/huangyc/p/10223075.html. Welcome to communicate: [email protected]

0. Directory

1. Introduction
2. WordPiece Principle
3. BPE Algorithm
4. Learning Materials
5. Conclusion

1. Introduction

The hottest paper of 2018 belongs to Google’s BERT, but today we will not introduce the BERT model itself, but rather a small module in BERT called WordPiece.

2. WordPiece Principle

Currently, most high-performance NLP models, such as OpenAI GPT and Google’s BERT, include a WordPiece process during data preprocessing. WordPiece literally means to break a word into pieces, which is exactly what it means.

One main implementation of WordPiece is called BPE (Byte-Pair Encoding).

The BPE process can be understood as breaking a word down further, making our vocabulary more concise and clearer in meaning.

For example, the three words “loved”, “loving”, and “loves” all have the same meaning of “love”. However, if we treat them as separate words, they are considered different terms. In English, there are many words with different suffixes, which can make the vocabulary very large, slowing down training speed and reducing training effectiveness.

The BPE algorithm can break these three words into parts such as “lov”, “ed”, “ing”, and “es”, effectively separating the meaning and tense of the words, significantly reducing the vocabulary size.

3. BPE Algorithm

The general training process of BPE: First, break words into individual characters, then count the occurrences of character pairs within the word, saving the most frequent character pairs each time until the loop ends.

Let’s simulate the BPE algorithm.

Our original vocabulary is as follows:

{‘l o w e r ‘: 2, ‘n e w e s t ‘: 6, ‘w i d e s t ‘: 3, ‘l o w ‘: 5}

Where the keys are the letters of the vocabulary split by characters, with a marker for the end, and the values represent the frequency of the words.

Next, we will find the highest frequency adjacent sequences in the entire vocabulary and merge them step by step.

Original vocabulary {'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3, 'l o w </w>': 5}
Most frequent sequence ('s', 't') 9
Merged vocabulary {'n e w e st </w>': 6, 'l o w e r </w>': 2, 'w i d e st </w>': 3, 'l o w </w>': 5}
Most frequent sequence ('e', 'st') 9
Merged vocabulary {'l o w e r </w>': 2, 'l o w </w>': 5, 'w i d est </w>': 3, 'n e w est </w>': 6}
Most frequent sequence ('est', '</w>') 9
Merged vocabulary {'w i d est</w>': 3, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'l o w </w>': 5}
Most frequent sequence ('l', 'o') 7
Merged vocabulary {'w i d est</w>': 3, 'lo w e r </w>': 2, 'n e w est</w>': 6, 'lo w </w>': 5}
Most frequent sequence ('lo', 'w') 7
Merged vocabulary {'w i d est</w>': 3, 'low e r </w>': 2, 'n e w est</w>': 6, 'low </w>': 5}
Most frequent sequence ('n', 'e') 6
Merged vocabulary {'w i d est</w>': 3, 'low e r </w>': 2, 'ne w est</w>': 6, 'low </w>': 5}
Most frequent sequence ('w', 'est</w>') 6
Merged vocabulary {'w i d est</w>': 3, 'low e r </w>': 2, 'ne west</w>': 6, 'low </w>': 5}
Most frequent sequence ('ne', 'west</w>') 6
Merged vocabulary {'w i d est</w>': 3, 'low e r </w>': 2, 'newest</w>': 6, 'low </w>': 5}
Most frequent sequence ('low', '</w>') 5
Merged vocabulary {'w i d est</w>': 3, 'low e r </w>': 2, 'newest</w>': 6, 'low</w>': 5}
Most frequent sequence ('i', 'd') 3
Merged vocabulary {'w id est</w>': 3, 'newest</w>': 6, 'low</w>': 5, 'low e r </w>': 2}

Through BPE, we obtain a more suitable vocabulary. This vocabulary may contain some combinations that are not actual words, but it is a meaningful form that accelerates NLP learning and enhances the semantic distinction between different words.

4. Learning Materials

Here are some resources about WordPiece and BPE for students to refer to.

https://github.com/tensorflow/models
https://github.com/rsennrich/subword-nmt
Subword in tensor2tensor
BPE in seq2seq
Neural Machine Translation of Rare Words with Subword Units
BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
How to use BPEmb

5. Conclusion

Since WordPiece or BPE is so effective, can we use it everywhere? Actually, it is not very suitable for Chinese. First, Chinese is not separated by spaces like English or other European languages; it is continuous. Secondly, a Chinese character is the smallest unit and cannot be split further. The general processing methods in Chinese are segmentation and character-based. Theoretically, segmentation is better than character-based because it is more detailed and separates meanings more clearly. Character-based is simpler, more efficient, and has a smaller vocabulary, with around 3000 commonly used characters.

Important! The Yizhen Natural Language Processing – Academic WeChat Group has been established.

You can scan the QR code below to join the group for communication.

Note: Please modify the remarks when adding as [School/Company + Name + Direction]

For example — Harbin Institute of Technology + Zhang San + Dialogue System.

Those who are in the business should consciously avoid this. Thank you!

Recommended Reading:
【Detailed Explanation】From Transformer to BERT Model
Sail Translation | Understanding Transformer from Scratch
Better than a Thousand Words! A Step-by-Step Guide to Building a Transformer with Python

Understanding WordPiece in BERT