FoolNLTK — A Simple and Easy-to-Use Chinese NLP Toolkit

FoolNLTK — The author claims it is “possibly not the fastest open-source Chinese word segmentation, but very likely the most accurate open-source Chinese word segmentation.”

This open-source toolkit is trained based on the BiLSTM model, with functions including word segmentation, part-of-speech tagging, and entity recognition. It also supports user-defined dictionaries, allows training of custom models, and batch processing of text.

1. Preparation

Before starting, ensure that Python and pip are successfully installed on your computer. If not, you can visit this article: The Most Detailed Python Installation Guide for installation.

If your purpose for using Python is data analysis, you can directly install Anaconda: Python Data Analysis and Mining Helper — Anaconda, which comes with Python and pip pre-installed.

Additionally, I recommend using the VSCode editor, which has many advantages: The Best Companion for Python Programming — Detailed Guide to VSCode.

Please choose one of the following methods to input commands to install dependencies:1. For Windows, open Cmd (Start – Run – CMD).2. For MacOS, open Terminal (command + space to input Terminal).3. If you are using the VSCode editor or PyCharm, you can directly use the Terminal at the bottom of the interface.

pip install foolnltk

2. Usage Instructions

2.1 Word Segmentation Function

You can achieve word segmentation through the fool.cut function:

import fool

text = "一个傻子在北京"
print(fool.cut(text))
# ['一个', '傻子', '在', '北京']

Command line word segmentation for a file:

python -m fool [filename]

2.2 User-Defined Dictionary

The dictionary format is as follows; the higher the weight of the word, the longer the length, the more likely it is to appear. The weight value should be greater than 1:

难受香菇 10
什么鬼 10
分词工具 10
北京 10
北京天安门 10

Load the dictionary:

import fool
fool.load_userdict(path) # path is the dictionary path
text = ["我在北京天安门看你难受香菇", "我在北京晒太阳你在非洲看雪"]
print(fool.cut(text))
#[['我', '在', '北京', '天安门', '看', '你', '难受', '香菇'],
# ['我', '在', '北京', '晒太阳', '你', '在', '非洲', '看', '雪']]

Delete the dictionary:

fool.delete_userdict()

2.3 Part-of-Speech Tagging

Part-of-speech tagging only requires using the pos_cut function, where the first dimension of the generated array is the recognition result of the corresponding string. The second dimension is the segmented words and their corresponding parts of speech.

import fool

text = ["一个傻子在北京"]
print(fool.pos_cut(text))
#[[('一个', 'm'), ('傻子', 'n'), ('在', 'p'), ('北京', 'ns')]]

2.4 Entity Recognition

The result elements of entity recognition contain the starting and ending coordinates of the keyword, the entity category, and the entity keyword.

import fool 

text = ["一个傻子在北京","你好啊"]
words, ners = fool.analysis(text)
print(ners)
#[[(5, 8, 'location', '北京')]]

3. Customize Your Own Model

You can customize your own model in a Python3 environment on Linux.

git clone https://github.com/rockyzhengwu/FoolNLTK.git
cd FoolNLTK/train

1.Training. The model training data_dir should store training data in the format as in datasets/demo. Download and link the training model to pretrainmodel.

python ./train_bert_ner.py --data_dir=data/bid_train_data \
  --bert_config_file=./pretrainmodel/bert_config.json \
  --init_checkpoint=./pretrainmodel/bert_model.ckpt \
  --vocab_file=./pretrainmodel/vocab.txt \
  --output_dir=./output/all_bid_result_dir/ --do_train

2.Exporting the Model. Specify do_export to export the model in pb format for deployment:

python ./train_bert_ner.py --data_dir=data/bid_train_data \
  --bert_config_file=./pretrainmodel/bert_config.json \
  --init_checkpoint=./pretrainmodel/bert_model.ckpt \
  --vocab_file=vocab.txt \
  --output_dir=./output/all_bid_result_dir/ --do_predict --do_export

3.Prediction. In bert_predict.py, specify the following three parameters to load the trained model for prediction:

VOCAB_FILE = './pretrainmodel/vocab.txt'
LABEL_FILE = './output/label2id.pkl'
EXPORT_PATH = './export_models/1581318324'

If you are interested in building your own model and have some questions, you can find detailed documentation here:https://github.com/rockyzhengwu/FoolNLTK/blob/master/train/README.md

END

FoolNLTK — A Simple and Easy-to-Use Chinese NLP Toolkit

Unheard Code · Knowledge Planet is now open!

One-on-one Q&A for crawler-related issues

Career consulting

Interview experience sharing

Weekly live sharing

……

Unheard Code · Knowledge Planet looks forward to meeting you~

FoolNLTK — A Simple and Easy-to-Use Chinese NLP Toolkit

Employees from major first and second-tier companies

Senior programmers with over ten years of coding experience

Students from universities at home and abroad

Newcomers just starting in primary and secondary schools

We are waiting for you in the “Unheard Code Technical Communication Group”!

How to join the group: Add WeChat “mekingname”, note “fan group” (No advertising, serious inquiries only!)

Leave a Comment