With the popularity of pre-trained models like BERT, ERNIE, and XLNet, it seems outdated to solve NLP problems without using pre-trained models.However, this is clearly incorrect.
As we all know, both training and inference with pre-trained models consume a large amount of computing power and heavily rely on GPU resources. However, many NLP problems can actually be solved adequately with justdictionaries + rules, so forcing the use of heavy models is like using a cannon to shoot a mosquito, which is very inefficient.
Therefore, we have carefully selected45 practical open-source tools and dictionaries from a rather crazy GitHub repo for everyone, allowing you to reduce dependence on models and computing power while building NLP systems and assisting in model training, and instead focus on small and beautiful code.
Repo address:
https://github.com/fighting41love/funNLP
Note: This is a very crazy repo that includes over 300 projects, but it is quite mixed, so remember to compare them horizontally.
Come, feel it out m(_ _)m
Who knows how I managed to read through all 300 repos (╯°□°)╯︵ ┻━┻
1. textfilter: Chinese and English sensitive word filtering
Repo: observerss/textfilter
>>> f = DFAFilter()
>>> f.add("sexy")
>>> f.filter("hello sexy baby")
hello **** baby
Sensitive words include political, vulgar, and other topical vocabulary. The principle is mainly based on dictionary lookups (in the project’s keyword file), and the content is not very authentic.
2. langid: 97 language detection
Repo: saffsd/langid.py
pip install langid
>>> import langid
>>> langid.classify("This is a test")
('en', -54.41310358047485)
3. langdetect: Another language detection
Address: https://code.google.com/archive/p/language-detection
pip install langdetect
from langdetect import detect
from langdetect import detect_langs
s1 = "本篇博客主要介绍两款语言探测工具,用于区分文本到底是什么语言,"
s2 = 'We are pleased to introduce today a new technology'
print(detect(s1))
print(detect(s2))
print(detect_langs(s3)) # detect_langs() outputs all detected language types and their proportions
The output results are as follows: Note: The language types are mainly based on the ISO 639-1 language coding standard, see ISO 639-1 on Baidu Encyclopedia.
Compared to the previous language detection, accuracy is lower, but efficiency is higher.
4. phone: Querying the origin of Chinese mobile numbers:
Repo: ls0f/phone
Integrated into the Python package cocoNLP
from phone import Phone
p = Phone()
p.find(18100065143)#return {'phone': '18100065143', 'province': 'Shanghai', 'city': 'Shanghai', 'zip_code': '200000', 'area_code': '021', 'phone_type': 'Telecom'}
Supports number segments: 13,15,18*,14[5,7],17[0,6,7,8]
Record count: 360569 (updated: April 2017)
The author provides the data phone.dat to facilitate non-Python users in loading data.
5. phone: International mobile and phone origin query:
Repo: AfterShip/phone
npm install phone
import phone from 'phone';
phone('+852 6569-8900'); // return ['+85265698900', 'HKG']
phone('(817) 569-8900'); // return ['+18175698900', 'USA']
6. ngender: Determine gender based on names:
Repo: observerss/ngender
Based on naive Bayes probability calculation.
pip install ngender
>>> import ngender
>>> ngender.guess('赵本山')
('male', 0.9836229687547046)
>>> ngender.guess('宋丹丹')
('female', 0.9759486128949907)
7. Regular expression for extracting emails
Integrated into the Python package cocoNLP
email_pattern = '^[*#\u4e00-\u9fa5 a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.[a-zA-Z0-9]{2,6}$'
emails = re.findall(email_pattern, text, flags=0)
8. Regular expression for extracting phone numbers
Integrated into the Python package cocoNLP
cellphone_pattern = '^((13[0-9])|(14[0-9])|(15[0-9])|(17[0-9])|(18[0-9]))\d{8}$'
phoneNumbers = re.findall(cellphone_pattern, text, flags=0)
9. Regular expression for extracting ID card numbers
IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])$'
IDs = re.findall(IDCards_pattern, text, flags=0)
10. Name corpus:
Repo: wainshine/Chinese-Names-Corpus
Name extraction functionality has been added to the Python package cocoNLP.
Chinese (modern, ancient) names, Japanese names, Chinese surnames and given names, titles (Aunt, Little Aunt, etc.), English->Chinese names (Li John), idiom dictionary
(Can be used for Chinese word segmentation, name recognition)
11. Chinese abbreviation library:
Repo: zhangyics/Chinese-abbreviation-dataset
全国人大: 全国/n 人民/n 代表大会/n
中国: 中华人民共和国/ns
女网赛: 女子/n 网球/n 比赛/vn
12. Chinese character decomposition dictionary:
Repo: kfcd/chaizi
汉字 拆法 (一) 拆法 (二) 拆法 (三)
拆 手 斥 扌 斥 才 斥
13. Vocabulary sentiment values:
Repo: rainarch/SentiBridge
山泉水 充沛 0.400704566541 0.370067395878
视野 宽广 0.305762728932 0.325320747491
大峡谷 惊险 0.312137906517 0.378594957281
14. Chinese vocabulary, stop words, sensitive words
Repo: dongxiexidian/Chinese
This package’s sensitive word library is more detailed:
Counter-revolutionary word library, sensitive word library statistics, violent terror word library, livelihood word library, pornography word library.
15. Chinese characters to Pinyin:
Repo: mozillazg/python-pinyin
Text correction will use this.
16. Simplified and traditional Chinese conversion:
Repo: skydark/nstools
17. English simulating Chinese pronunciation engine
Repo: tinyfool/ChineseWithEnglish
say wo i ni
# Say: I love you
Equivalent to simulating Chinese pronunciation using English phonetics.
18. Synonym library, antonym library, negation library:
Repo: guotong1988/chinese_dictionary
19. Chinese character data
Repo: skishore/makemeahanzi
-
Stroke order for simplified/traditional Chinese characters
-
Vector strokes
20. Splitting and extracting words from unspaced English strings:
Repo: keredson/wordninja
>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']
>>> wordninja.split('imateapot')
['im', 'a', 'teapot']
21. Regular expression for IP addresses:
(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)
22. Regular expression for Tencent QQ numbers:
[1-9]([0-9]{5,11})
23. Regular expression for domestic landline numbers:
[0-9-()()]{7,18}
24. Regular expression for usernames:
[A-Za-z0-9_\-\u4e00-\u9fa5]+
25. g2pC: Context-based automatic Chinese pronunciation marking module
Repo: Kyubyong/g2pC
26. Time extraction:
Integrated into the Python package cocoNLP
In the test executed on June 7, 2016, at 9:44, the results are as follows:
Hi, all. Meeting next Monday at 3 PM.
>>> 2016-06-13 15:00:00-false
Meeting next Monday
>>> 2016-06-13 00:00:00-true
Meeting the Monday after next
Java version: https://github.com/shinyke/Time-NLP
Python version: https://github.com/zhanzecheng/Time_NLP
27. Quick conversion between “Chinese numbers” and “Arabic numbers”
Repo: HaveTwoBrush/cn2an
-
Conversion between Chinese and Arabic numbers
-
Mixing cases of Chinese and Arabic numbers is under development
28. Comprehensive list of company names
Repo: wainshine/Company-Names-Corpus
29. Ancient poetry corpus
Repo: panhaiqi/AncientPoetry
For a more comprehensive ancient poetry corpus: https://github.com/chinese-poetry/chinese-poetry
30. THU organized vocabulary library
Repo: http://thuocl.thunlp.org/
Organized in the data folder of this repo.
IT vocabulary, finance vocabulary, idiom vocabulary, place name vocabulary, historical figure vocabulary, poetry vocabulary, medical vocabulary, food vocabulary, legal vocabulary, automotive vocabulary, animal vocabulary
31. PDF table data extraction tool
Repo: camelot-dev/camelot
32. Domestic phone number regular matching (three major operators + virtual, etc.)
Repo: VincentSit/ChinaMobilePhoneNumberRegex
33. Username blacklist:
Repo: marteinn/The-Big-Username-Blacklist
Includes a list of disabled usernames, such as:
administrator
administration
autoconfig
autodiscover
broadcasthost
domain
editor
guest
host
hostmaster
info
keybase.txt
localdomain
localhost
master
mail
mail0
mail
34. Microsoft multilingual number/unit/date-time recognition package:
Repo: Microsoft/Recognizers-Text
35. chinese-xinhua Chinese Xinhua dictionary database and API, including common idioms, phrases, and Chinese characters
Repo: pwxcoo/chinese-xinhua
36. Automatic document graph generation
Repo: liuhuanyong/TextGrapher
-
TextGrapher – Text Content Grapher based on key information extraction by NLP method. Input a document, extract key information, structure it, and finally organize it into a graphical representation of the semantic information of the article.
37. Number naming library for 186 languages
Repo: google/UniNum
38. Simplified and traditional Chinese conversion
Repo: berniey/hanziconv
39. Chinese character feature extractor (featurizer), extracting features of Chinese characters (pronunciation features, character features) for deep learning
Repo: howl-anderson/hanzi_char_featurizer
40. Chinese abbreviation dataset
Repo: zhangyics/Chinese-abbreviation-dataset
41. Wudao Dictionary – Command-line version of Youdao Dictionary, supporting Chinese-English mutual checking and online querying
Repo: ChestnutHeng/Wudao-dict
42. The best Chinese number (Chinese numbers) to Arabic number conversion tool
Repo: Wall-ee/chinese2digits
43. LineFlow: Efficient data loader for NLP across all deep learning frameworks
Repo: tofunlp/lineflow
44. Parsing natural language number strings into integers and floats
Repo: jaidevd/numerizer
45. A comprehensive list of English profanity words
Repo: zacanger/profane-words
Recommended reading:
In-depth analysis of the design principles of LSTM neural networks
Complete guide for beginners on Graph Convolutional Networks (GCN)
Paper review [ACL18] based on Self-Attentive constituent syntax analysis