Latest Achievements
Subword Tokenization
Overview of Subword Tokenization Methods
This article provides an overview of the Subword Tokenization methods in neural natural language processing techniques. It first explains the out-of-vocabulary (OOV) problem caused by closed vocabulary in neural network-based natural language processing methods and introduces three common solutions: Byte-Pair Encoding (BPE), WordPiece, and Unigram. Before subword tokenization, word segmentation is usually required, which is highly language-dependent. SentencePiece offers a language-independent subword tokenization method that can perform subword tokenization directly on the input sentences without prior word segmentation. Sometimes, subword tokenization may encounter unreasonable splits and insufficient learning of subword representations. This article subsequently introduces subword regularization techniques and BPE-Dropout techniques to address these issues. Character-based subword tokenization still faces OOV problems when dealing with multilingual scenarios (especially with large character sets like Chinese, Japanese, and Korean). This article will introduce an effective solution to this issue: Byte Level BPE (BBPE) based on UTF-8 bytes, along with its derived SentencePiece solution based on BBPE. Finally, the article presents a general vocabulary optimization technique called VOLT, proposed in the best paper of ACL 2021.
This article is a research achievement from Huawei Noah’s Ark Lab, authored by Professor Liu Qun, and has been published in the Communications of the Chinese Association for Artificial Intelligence, Issue 3, 2022.
Scan to View Detailed Content
ChinaXiv
ChinaXiv is based in China and is a preprint academic exchange platform that operates according to international standards. It aims to provide a preliminary research results exchange environment for the Chinese scientific community based on preprints, promoting continuous improvement and development of research achievements through exchange, and more effectively converting research achievements into productivity.
We thank the broad community of scientific and technological workers and all sectors of society for their attention and support, and we welcome everyone to actively submit contributions!
Past Reviews
We sincerely invite your participation in the “Cognition and Use of Preprint Platforms” survey for the China Academy of Sciences Preprint Platform ChinaXiv.
Work Updates: The cooperation promotion meeting for the China Academy of Sciences Preprint Platform ChinaXiv was held in Beijing.