Big Data Digest
Compiled by: Tang Zhuzhi, Wu Shuang, Qian Tianpei
Natural Language Processing (NLP) is a combination of art and science aimed at extracting information from text data. With its help, we extract information suitable for computer algorithms from text. From automatic translation, text classification to sentiment analysis, NLP has become an essential skill for all data scientists.
In this article, you will learn about the 10 most common NLP tasks, along with relevant resources and code.
Why write this article?
I have also researched NLP problems for some time. During this period, I needed to go through a lot of materials, learning about the latest developments in the field through research reports, blogs, and competitions on similar NLP issues, and dealing with various situations encountered during NLP processing.
Therefore, I decided to consolidate these resources to create a one-stop solution that provides the latest relevant resources for common NLP tasks. Below is the list of tasks mentioned in the article along with related resources. Let’s get started.
Table of Contents:
1. Stemming
2. Lemmatization
3. Word Vectorization
4. Part-of-Speech Tagging
5. Named Entity Disambiguation
6. Named Entity Recognition
7. Sentiment Analysis
8. Semantic Text Similarity
9. Language Identification
10. Text Summarization
1. Stemming
What is stemming? Stemming is the process of removing inflections or derivations from words to convert them to their root or base form. The goal of stemming is to reduce related words to the same stem, even if the stem is not a dictionary entry. For example, in English:
1. The stems of beautiful and beautifully are beauti
2. The stems of good, better, and best are good, better, and best respectively.
Related Paper: Original text of Martin Porter’s Porter Stemming Algorithm
Related Algorithm: The Porter2 stemming algorithm can be used in Python (https://tartarus.org/martin/PorterStemmer/def.txt)
Program Implementation: Here is how to use it in the Python stemming library (https://bitbucket.org/mchaput/stemming/src/5c242aa592a6d4f0e9a0b2e1afdca4fd757b8e8a/stemming/porter2.py?at=default&fileviewer=file-view-default)
Code for stemming using the Porter2 algorithm:
#!pip install stemming
from stemming.porter2 import stem
stem("casually")
2. Lemmatization
What is lemmatization? Lemmatization is the process of converting a set of words to their root or dictionary form. The lemmatization process considers the POS problem, i.e., the meaning of the word in the sentence, the semantics of words in adjacent sentences, etc. For example, in English:
1. beautiful and beautifully are lemmatized to beautiful and beautifully respectively.
2. good, better, and best are lemmatized to good, good, and good respectively.
Related Paper 1: This paper discusses different methods of lemmatization in detail. A must-read to understand how traditional lemmatization works. (http://www.ijrat.org/downloads/icatest2015/ICATEST-2015127.pdf)
Related Paper 2: This excellent paper discusses the challenges faced when lemmatizing morphologically rich languages using deep learning. (https://academic.oup.com/dsh/article-abstract/doi/10.1093/llc/fqw034/2669790/Lemmatization-for-variation-rich-languages-using)
Dataset: Here is the link to the Treebank-3 dataset, which you can use to create your own lemmatization tool. (https://catalog.ldc.upenn.edu/ldc99t42)
Program Implementation: Below is the code for English lemmatization using spacy.
#!pip install spacy
#python -m spacy download en
import spacy
nlp=spacy.load("en")
doc="good better best"
for token in nlp(doc):
print(token,token.lemma_)
3. Word Vectorization
What is word vectorization? Word vectorization refers to the representation of natural language using a vector composed of a set of real numbers. This technique is very useful because computers cannot process natural language. Word vectorization can capture the essential relationship between natural language and real numbers. Through word vectorization, a word or phrase can be represented by a fixed-dimensional vector, for example, a vector length of 100.
For example, the word “Man” can be represented by a five-dimensional vector.
Here, each number represents the magnitude of the word in a specific direction.
Related Blog: This article explains word vectorization in detail. (https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)
Related Paper: This paper explains the details of word vectorization. A must-read for a deep understanding of word vectorization. (https://www.analyticsvidhya.com/blog/2017/10/essential-nlp-guide-data-scientists-top-10-nlp-tasks/)
Related Tool: This is a browser-based word vector visualization tool. (https://ronxin.github.io/wevi/)
Pre-trained Word Vectors: Here is a list of Facebook’s pre-trained word vectors, which include 294 languages. (https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md)
You can download Google’s news pre-trained word vectors here. (https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)
#!pip install gensim
from gensim.models.keyedvectors import KeyedVectors
word_vectors=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
word_vectors['human']
Program Implementation: This code can be used to train your own word vectors with gensim.
sentence=[['first','sentence'],['second','sentence']]
model = gensim.models.Word2Vec(sentence, min_count=1,size=300,workers=4)
4. Part-of-Speech Tagging
What is part-of-speech tagging? In simple terms, part-of-speech tagging is the process of tagging words in a sentence as nouns, verbs, adjectives, adverbs, etc. For example, for the sentence “Ashok killed the snake with a stick”, part-of-speech tagging would identify:
Ashok pronoun
killed verb
the determiner
snake noun
with conjunction
a determiner
stick noun
. punctuation
Paper 1: Choi aptly’s paper “The Last Gist to the State-of-the-Art” introduces a new method called dynamic feature induction. This is currently the most advanced method for part-of-speech tagging. (https://aclweb.org/anthology/N16-1031.pdf)
Paper 2: This article introduces a method for unsupervised part-of-speech tagging using hidden Markov models. (https://transacl.org/ojs/index.php/tacl/article/viewFile/837/192)
Program Implementation: This code can perform part-of-speech tagging using spacy.
#!pip install spacy
#!python -m spacy download en
nlp=spacy.load('en')
sentence="Ashok killed the snake with a stick"
for token in nlp(sentence):
print(token,token.pos_)
5. Named Entity Disambiguation
What is named entity disambiguation? Named entity disambiguation is the process of recognizing mentioned entities in a sentence. For example, for the sentence “Apple earned a revenue of 200 Billion USD in 2016”, named entity disambiguation would infer that Apple in the sentence refers to Apple Inc. and not to a fruit. Generally, named entity disambiguation requires an entity knowledge base that can relate the mentioned entities in the sentence to the knowledge base.
Paper 1: Huang’s paper applies a deep semantic association model based on deep neural networks and knowledge bases, achieving leading results in named entity disambiguation. (https://arxiv.org/pdf/1504.07678.pdf)
Paper 2: Ganea and Hofmann’s paper uses a local neural attention model and word vectorization without manually set features. (https://arxiv.org/pdf/1704.04920.pdf)
6. Named Entity Recognition
Named Entity Recognition (NER) is the task of identifying entities with specific meanings in a sentence and categorizing them into types such as person names, organization names, dates, locations, and times. For example, an NER would return the following results for a sentence like:
“Ram of Apple Inc. traveled to Sydney on 5th October 2017”
Returns the following results:
Ram
of
Apple ORG
Inc. ORG
traveled
to
Sydney GPE
on
5th DATE
October DATE
2017 DATE
Here, ORG represents organization names, and GPE represents geographic locations.
However, when NER is applied to a domain different from the one it was trained on, even the most advanced NER often performs poorly.
Paper: This excellent paper uses a bidirectional LSTM (Long Short-Term Memory) neural network combined with supervised and unsupervised learning methods to achieve the latest results in named entity recognition across four language domains. (https://arxiv.org/pdf/1603.01360.pdf)
Program Implementation: Below is how to perform named entity recognition using spacy.
import spacy
nlp=spacy.load('en')
sentence="Ram of Apple Inc. traveled to Sydney on 5th October 2017"
for token in nlp(sentence):
print(token, token.ent_type_)
7. Sentiment Analysis
What is sentiment analysis? Sentiment analysis is a broad subjective analysis that uses natural language processing techniques to identify the semantic sentiment of customer reviews, the emotional positivity or negativity expressed in statements, and to judge the sentiment expressed through voice analysis or written text, etc. For example:
“I don’t like chocolate ice cream”—is a negative evaluation of that ice cream.
“I don’t hate chocolate ice cream”—can be considered a neutral evaluation.
There are many methods to perform sentiment analysis, starting from using LSTMs and word embeddings to count positive and negative words in a sentence.
Blog 1: This article focuses on sentiment analysis of movie tweets (https://www.analyticsvidhya.com/blog/2016/02/step-step-guide-building-sentiment-analysis-model-graphlab/)
Blog 2: This article focuses on sentiment analysis of tweets during the Chennai floods in India. (https://www.analyticsvidhya.com/blog/2017/01/sentiment-analysis-of-twitter-posts-on-chennai-floods-using-python/)
Paper 1: This article uses a supervised learning method based on Naive Bayes to classify IMDB reviews. (https://arxiv.org/pdf/1305.6143.pdf)
Paper 2: This article uses an unsupervised learning method based on LDA to identify sentiments and opinions in user-generated reviews. This article performs outstandingly in addressing the issue of insufficient annotated reviews. (http://www.cs.cmu.edu/~yohanj/research/papers/WSDM11.pdf)
Repository: This is a great repository containing relevant research papers and various language sentiment analysis implementations. (https://github.com/xiamx/awesome-sentiment-analysis)
Dataset 1: Multi-domain sentiment dataset version 2.0 (http://www.cs.jhu.edu/~mdredze/datasets/sentiment/)
Dataset 2: Twitter sentiment analysis dataset (http://www.sananalytics.com/lab/twitter-sentiment/)
Competition: A very good competition where you can check how your model performs on the sentiment analysis task of Rotten Tomatoes movie reviews. (https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews)
8. Semantic Text Similarity
What is semantic text similarity analysis? Semantic text similarity analysis is the process of analyzing the similarity between the meanings and essence of two texts. Note that similarity is different from relevance.
For example:
Cars and buses are similar, but cars and fuel are relevant.
Paper 1: This article details different methods of measuring text similarity. A must-read article to understand all current methods in one place. (https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf)
Paper 2: This article introduces using CNN neural networks to compare two short texts. (http://casa.disi.unitn.it/~moschitt/since2013/2015_SIGIR_Severyn_LearningRankShort.pdf)
Paper 3: This article uses Tree-LSTMs to achieve state-of-the-art results in semantic relevance and semantic classification of texts. (https://nlp.stanford.edu/pubs/tai-socher-manning-acl2015.pdf)
Language Identification
What is language identification? Language identification refers to the task of distinguishing texts in different languages. It uses the statistical and grammatical properties of languages to perform this task. Language identification can also be considered a special case of text classification.
Blog: This blog written by fastText introduces a new tool that can identify 170 languages with a memory usage of 1MB. (https://fasttext.cc/blog/2017/10/02/blog-post.html)
Paper 1: This article discusses seven language identification methods across 285 languages. (http://www.ep.liu.se/ecp/131/021/ecp17131021.pdf)
Paper 2: This article describes how to achieve state-of-the-art results in automatic language identification using deep neural networks. (https://repositorio.uam.es/bitstream/handle/10486/666848/automatic_lopez-moreno_ICASSP_2014_ps.pdf?sequence=1)
10. Text Summarization
What is text summarization? Text summarization is the process of shortening a text by identifying its key points and using those points to create a summary. The purpose of text summarization is to minimize the text while maintaining its meaning.
Paper 1: This article describes an abstract statement summarization method based on neural attention models. (https://arxiv.org/pdf/1509.00685.pdf)
Paper 2: This article describes the latest results achieved in text summarization using sequence-to-sequence RNNs. (https://arxiv.org/pdf/1602.06023.pdf)
Repository: This repository from the Google Brain team has code for sequence-to-sequence models customized for text summarization. The model is trained on the Gigaword dataset. (https://github.com/tensorflow/models/tree/master/research/textsum)
Application: Reddit’s autotldr bot uses text summarization to summarize various comments from articles to posts. This feature is very popular among Reddit users. (https://www.reddit.com/r/autotldr/comments/31b9fm/faq_autotldr_bot/)
Program Implementation: Below is how to quickly implement text summarization using the gensim package.
from gensim.summarization import summarize
sentence="Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.Automatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a subset of data which contains the information of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.There are two general approaches to automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might express. Such a summary might include verbal innovations. Research to date has focused primarily on extractive methods, which are appropriate for image collection summarization and video summarization."
summarize(sentence)
Conclusion
Above is an introductory overview and compilation of the most common NLP tasks. If you have more quality resources, feel free to share in the comments!
Original link:<https://www.analyticsvidhya.com/blog/2017/10/essential-nlp-guide-data-scientists-top-10-nlp-tasks/>
Double Eleven Discount
Cloud Platform Free Experience 2 Hours
The standardized cloud environment practice platform equipped with training camp courses integrates data science tools, with versions and libraries fully aligned with industry standards, no need for local environment configuration.
One browser, one command, and you can start your learning journey immediately!
Scan the code for a free experience!
Volunteer Introduction
Reply“Volunteer” to join us
Past Excellent Articles
Click the image to read
Tencent Wants AI in All: The Director of Westworld and Tencent COO Just Discussed Why People Should Live