45 Useful Niche NLP Open Source Dictionaries and Tools

Follow the official account “ML_NLP“

Set as “Starred“, heavy content delivered first!

45 Useful Niche NLP Open Source Dictionaries and Tools

Introduction

With the popularity of pre-trained models like BERT, ERNIE, and XLNet, it seems outdated to solve NLP problems without using pre-trained models.However, this is clearly incorrect.

As we all know, both training and inference with pre-trained models consume a large amount of computing power and heavily rely on GPU resources. However, many NLP problems can actually be solved adequately with justdictionaries + rules, so forcing the use of heavy models is like using a cannon to shoot a mosquito, which is very inefficient.

Therefore, we have carefully selected45 practical open-source tools and dictionaries from a rather crazy GitHub repo for everyone, allowing you to reduce dependence on models and computing power while building NLP systems and assisting in model training, and instead focus on small and beautiful code.

Repo address:

https://github.com/fighting41love/funNLP

Note: This is a very crazy repo that includes over 300 projects, but it is quite mixed, so remember to compare them horizontally.

Come, feel it out m(_ _)m

Who knows how I managed to read through all 300 repos (╯°□°）╯︵ ┻━┻

1. textfilter: Chinese and English sensitive word filtering

Repo: observerss/textfilter

 >>> f = DFAFilter()
 >>> f.add("sexy")
 >>> f.filter("hello sexy baby")
 hello **** baby

Sensitive words include political, vulgar, and other topical vocabulary. The principle is mainly based on dictionary lookups (in the project’s keyword file), and the content is not very authentic.

2. langid: 97 language detection

Repo: saffsd/langid.py

pip install langid

>>> import langid
>>> langid.classify("This is a test")
('en', -54.41310358047485)

3. langdetect: Another language detection

Address: https://code.google.com/archive/p/language-detection

pip install langdetect

from langdetect import detect
from langdetect import detect_langs

s1 = "本篇博客主要介绍两款语言探测工具，用于区分文本到底是什么语言，"
s2 = 'We are pleased to introduce today a new technology'
print(detect(s1))
print(detect(s2))
print(detect_langs(s3))    # detect_langs() outputs all detected language types and their proportions

The output results are as follows: Note: The language types are mainly based on the ISO 639-1 language coding standard, see ISO 639-1 on Baidu Encyclopedia.

Compared to the previous language detection, accuracy is lower, but efficiency is higher.

4. phone: Querying the origin of Chinese mobile numbers:

Repo: ls0f/phone

Integrated into the Python package cocoNLP

from phone import Phone
p  = Phone()
p.find(18100065143)#return {'phone': '18100065143', 'province': 'Shanghai', 'city': 'Shanghai', 'zip_code': '200000', 'area_code': '021', 'phone_type': 'Telecom'}

Supports number segments: 13,15,18*,14[5,7],17[0,6,7,8]

Record count: 360569 (updated: April 2017)

The author provides the data phone.dat to facilitate non-Python users in loading data.

5. phone: International mobile and phone origin query:

Repo: AfterShip/phone

npm install phone

import phone from 'phone';
phone('+852 6569-8900'); // return ['+85265698900', 'HKG']
phone('(817) 569-8900'); // return ['+18175698900', 'USA']

6. ngender: Determine gender based on names:

Repo: observerss/ngender

Based on naive Bayes probability calculation.

pip install ngender

>>> import ngender
>>> ngender.guess('赵本山')
('male', 0.9836229687547046)
>>> ngender.guess('宋丹丹')
('female', 0.9759486128949907)

7. Regular expression for extracting emails

Integrated into the Python package cocoNLP

email_pattern = '^[*#\u4e00-\u9fa5 a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.[a-zA-Z0-9]{2,6}$'
emails = re.findall(email_pattern, text, flags=0)

8. Regular expression for extracting phone numbers

Integrated into the Python package cocoNLP

cellphone_pattern = '^((13[0-9])|(14[0-9])|(15[0-9])|(17[0-9])|(18[0-9]))\d{8}$'
phoneNumbers = re.findall(cellphone_pattern, text, flags=0)

9. Regular expression for extracting ID card numbers

IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])$'
IDs = re.findall(IDCards_pattern, text, flags=0)

10. Name corpus:

Repo: wainshine/Chinese-Names-Corpus

Name extraction functionality has been added to the Python package cocoNLP.

Chinese (modern, ancient) names, Japanese names, Chinese surnames and given names, titles (Aunt, Little Aunt, etc.), English->Chinese names (Li John), idiom dictionary

(Can be used for Chinese word segmentation, name recognition)

11. Chinese abbreviation library:

Repo: zhangyics/Chinese-abbreviation-dataset

全国人大: 全国/n 人民/n 代表大会/n
中国: 中华人民共和国/ns
女网赛: 女子/n 网球/n 比赛/vn

12. Chinese character decomposition dictionary:

Repo: kfcd/chaizi

汉字    拆法 (一)    拆法 (二)    拆法 (三)
拆    手 斥    扌 斥    才 斥

13. Vocabulary sentiment values:

Repo: rainarch/SentiBridge

山泉水    充沛    0.400704566541    0.370067395878
视野            宽广    0.305762728932    0.325320747491
大峡谷    惊险    0.312137906517    0.378594957281

14. Chinese vocabulary, stop words, sensitive words

Repo: dongxiexidian/Chinese

This package’s sensitive word library is more detailed:

Counter-revolutionary word library, sensitive word library statistics, violent terror word library, livelihood word library, pornography word library.

15. Chinese characters to Pinyin:

Repo: mozillazg/python-pinyin

Text correction will use this.

16. Simplified and traditional Chinese conversion:

Repo: skydark/nstools

17. English simulating Chinese pronunciation engine

Repo: tinyfool/ChineseWithEnglish

say wo i ni
# Say: I love you

Equivalent to simulating Chinese pronunciation using English phonetics.

18. Synonym library, antonym library, negation library:

Repo: guotong1988/chinese_dictionary

19. Chinese character data

Repo: skishore/makemeahanzi

Stroke order for simplified/traditional Chinese characters
Vector strokes

20. Splitting and extracting words from unspaced English strings:

Repo: keredson/wordninja

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']
>>> wordninja.split('imateapot')
['im', 'a', 'teapot']

21. Regular expression for IP addresses:

(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)

22. Regular expression for Tencent QQ numbers:

[1-9]([0-9]{5,11})

23. Regular expression for domestic landline numbers:

[0-9-()（）]{7,18}

24. Regular expression for usernames:

[A-Za-z0-9_\-\u4e00-\u9fa5]+

25. g2pC: Context-based automatic Chinese pronunciation marking module

Repo: Kyubyong/g2pC

26. Time extraction:

Integrated into the Python package cocoNLP

In the test executed on June 7, 2016, at 9:44, the results are as follows:

Hi, all. Meeting next Monday at 3 PM.

>>> 2016-06-13 15:00:00-false
Meeting next Monday

>>> 2016-06-13 00:00:00-true
Meeting the Monday after next

Java version: https://github.com/shinyke/Time-NLP

Python version: https://github.com/zhanzecheng/Time_NLP

27. Quick conversion between “Chinese numbers” and “Arabic numbers”

Repo: HaveTwoBrush/cn2an

Conversion between Chinese and Arabic numbers
Mixing cases of Chinese and Arabic numbers is under development

28. Comprehensive list of company names

Repo: wainshine/Company-Names-Corpus

29. Ancient poetry corpus

Repo: panhaiqi/AncientPoetry

For a more comprehensive ancient poetry corpus: https://github.com/chinese-poetry/chinese-poetry

30. THU organized vocabulary library

Repo: http://thuocl.thunlp.org/

Organized in the data folder of this repo.

IT vocabulary, finance vocabulary, idiom vocabulary, place name vocabulary, historical figure vocabulary, poetry vocabulary, medical vocabulary, food vocabulary, legal vocabulary, automotive vocabulary, animal vocabulary

31. PDF table data extraction tool

Repo: camelot-dev/camelot

32. Domestic phone number regular matching (three major operators + virtual, etc.)

Repo: VincentSit/ChinaMobilePhoneNumberRegex

33. Username blacklist:

Repo: marteinn/The-Big-Username-Blacklist

Includes a list of disabled usernames, such as:

administrator
administration
autoconfig
autodiscover
broadcasthost
domain
editor
guest
host
hostmaster
info
keybase.txt
localdomain
localhost
master
mail
mail0
mail

34. Microsoft multilingual number/unit/date-time recognition package:

Repo: Microsoft/Recognizers-Text

35. chinese-xinhua Chinese Xinhua dictionary database and API, including common idioms, phrases, and Chinese characters

Repo: pwxcoo/chinese-xinhua

36. Automatic document graph generation

Repo: liuhuanyong/TextGrapher

TextGrapher – Text Content Grapher based on key information extraction by NLP method. Input a document, extract key information, structure it, and finally organize it into a graphical representation of the semantic information of the article.

37. Number naming library for 186 languages

Repo: google/UniNum

38. Simplified and traditional Chinese conversion

Repo: berniey/hanziconv

39. Chinese character feature extractor (featurizer), extracting features of Chinese characters (pronunciation features, character features) for deep learning

Repo: howl-anderson/hanzi_char_featurizer

40. Chinese abbreviation dataset

Repo: zhangyics/Chinese-abbreviation-dataset

41. Wudao Dictionary – Command-line version of Youdao Dictionary, supporting Chinese-English mutual checking and online querying

Repo: ChestnutHeng/Wudao-dict

42. The best Chinese number (Chinese numbers) to Arabic number conversion tool

Repo: Wall-ee/chinese2digits

43. LineFlow: Efficient data loader for NLP across all deep learning frameworks

Repo: tofunlp/lineflow

44. Parsing natural language number strings into integers and floats

Repo: jaidevd/numerizer

45. A comprehensive list of English profanity words

Repo: zacanger/profane-words

In addition, this repo also includes many datasets, but they are quite mixed, so we have skipped them. Those who need it can explore the repo.

Recommended reading:

In-depth analysis of the design principles of LSTM neural networks

Complete guide for beginners on Graph Convolutional Networks (GCN)

Paper review [ACL18] based on Self-Attentive constituent syntax analysis

45 Useful Niche NLP Open Source Dictionaries and Tools

Leave a Comment Cancel reply