Understanding Natural Language Processing: Key Tasks Explained

This article is reproduced from the public account: AI Beginner’s Guide

The editor says:

Author Le Yuquan, a master’s student at Hunan University, focuses on machine learning and natural language processing, and has published several articles in conferences/journals such as IJCAI and TASLP. This article is a summary of the author’s learning process. Friends with similar interests are welcome to follow the author’s public account “AI Beginner’s Guide” for communication and learning together.

In the coming time, we will launch the “Understanding in One Article” series, please stay tuned.

This article summarizes the author’s understanding based on their learning and relevant research, briefly introducing some related technologies and tasks of natural language processing (NLP). NLP technologies include basic technologies and applied technologies. More topics will be elaborated on in this series as time allows. Due to the author’s limited level, there may inevitably be errors and omissions, and readers are welcome to correct them.

Development

It is generally believed that the famous “Turing Test” proposed by Turing in 1950 marks the beginning of the idea of natural language processing. From the 1950s to the 1970s, natural language processing mainly adopted rule-based methods. Rule-based methods cannot cover all statements and require high demands on developers. At this time, natural language processing remained at the stage of rationalism.

After the 1970s, with the rapid development of the internet, the richness of corpora, and the improvement of hardware, the thought of natural language processing transitioned from rationalism to empiricism, with statistical methods gradually replacing rule-based methods.

From 2008 to the present, due to breakthroughs in deep learning in fields such as image recognition and speech recognition, people have gradually begun to introduce deep learning into natural language processing research. From the initial word vectors to the 2013 word2vec, the combination of deep learning and natural language processing reached a climax, achieving certain successes in machine translation, question answering systems, reading comprehension, and other fields. Recently, models like emlo and BERT may be unveiling the next chapter.

Definition

Natural language refers to languages used in daily communication, such as Chinese and English, which have evolved naturally with the development of human society, as opposed to artificial languages. Natural language is an important tool for human learning and life. In other words, natural language refers to the conventions established by human society, distinguishing it from artificial languages, such as programming languages.

Processing includes understanding, transformation, and generation. Natural language processing refers to the use of computers to process the form, sound, and meaning of natural language information, that is, operations and processing on input and output of characters (or characters in English), words, sentences, paragraphs, and texts. Achieving information exchange between humans and machines is an important issue of common concern in the fields of artificial intelligence, computer science, and linguistics. Therefore, natural language processing is also known as the jewel in the crown of artificial intelligence.

It can be said that natural language processing aims for computers to understand natural language. The mechanism of natural language processing involves two processes: natural language understanding and natural language generation. Natural language understanding refers to the computer’s ability to understand the meaning of natural language text, while natural language generation refers to the ability to express a given intention in natural language text. The understanding and analysis of natural language is a hierarchical process, with many linguists dividing this process into five levels, which better reflect the composition of the language itself. The five levels are phonetic analysis, lexical analysis, syntactic analysis, semantic analysis, and pragmatic analysis.

Phonetic analysis involves distinguishing individual phonemes from the speech stream according to phoneme rules and identifying syllables and their corresponding morphemes or words based on phoneme morphology rules.

Lexical analysis involves identifying the various morphemes of vocabulary to obtain linguistic information.

Syntactic analysis analyzes the structure of sentences and phrases, aiming to identify the interrelations of words, phrases, and their roles in sentences.

Semantic analysis refers to using various machine learning methods to learn and understand the semantic content represented by a segment of text. Semantic analysis is a very broad concept.

Pragmatic analysis examines the influence of the external environment on language users.

Basic Technologies

Basic technologies include lexical analysis, syntactic analysis, semantic analysis, etc.

Lexical Analysis

Lexical analysis includes Chinese word segmentation (word segmentation or tokenization) and part-of-speech tagging (part-of-speech tag).

Chinese word segmentation: The primary task of processing Chinese (which has built-in segmentation in English) is to segment the input string into individual words, a step known as segmentation.

Part-of-speech tagging: The purpose of part-of-speech tagging is to assign a category to each word, which is known as a part-of-speech tag. For example, nouns (noun), verbs (verb), etc.

Syntactic Parsing

Syntactic parsing is the process of analyzing the input text sentence to obtain the syntactic structure of the sentence. The most common syntactic parsing tasks include the following:

Phrase-structure syntactic parsing: This task, also known as constituent syntactic parsing, aims to identify the phrase structure within a sentence and the hierarchical syntactic relationships between phrases.

Dependency syntactic parsing: This task aims to identify the interdependencies between words in a sentence.

Deep grammar syntactic parsing: This involves using deep grammars, such as Lexicalized Tree Adjoining Grammar (LTAG), Lexical Functional Grammar (LFG), Combinatory Categorial Grammar (CCG), etc., to perform deep syntactic and semantic analysis of sentences.

Semantic Analysis

The ultimate goal of semantic analysis is to understand the true meaning expressed by a sentence. However, the question of what representation form should be used for semantics has long troubled researchers, and there is still no unified answer to this question. Semantic role labeling is currently a relatively mature shallow semantic analysis technique.

In summary, natural language processing systems typically adopt a cascading approach, where segmentation, part-of-speech tagging, syntactic analysis, and semantic analysis are trained separately. During use, given an input sentence, each module is used sequentially for analysis, ultimately obtaining all results.

In recent years, researchers have proposed many effective joint models that jointly learn and decode multiple tasks, such as joint segmentation and part-of-speech tagging, part-of-speech and syntactic joint, segmentation, part-of-speech, and syntactic joint, and syntactic and semantic joint, achieving good results.

Applied Technologies

On the other hand, there are applied technologies in natural language processing, which often rely on basic technologies, including Text Clustering, Text Classification, Text Summarization, Sentiment Analysis, Question Answering (QA), Machine Translation (MT), Information Extraction, Information Recommendation, Information Retrieval (IR), etc.

Since each task involves many aspects, I will briefly summarize these tasks here and provide detailed summaries of various technologies later as my learning deepens.

Text Classification: The text classification task is to automatically assign predefined category labels based on the content or topic of a given document. This includes single-label classification and multi-label text classification.

Text Clustering: This task involves partitioning a collection of documents into several subsets based on the content or topic similarity between documents, where documents within each subset are highly similar, while the similarity between subsets is low.

Text Summarization: The text summarization task refers to compressing and refining the original text to provide users with a concise description.

Sentiment Analysis: The sentiment analysis task refers to using computers to analyze and mine the opinions, sentiments, attitudes, emotions, etc., in text data.

Question Answering: Automatic question answering refers to the task of using computers to automatically answer user questions to meet their knowledge needs.

Machine Translation: Machine translation refers to the automatic translation from one natural language to another using computers. The language being translated is called the source language, and the language being translated to is called the target language.

Information Extraction: Information extraction refers to the process of extracting specified types of information (such as entities, attributes, relationships, events, product records, etc.) from unstructured/semi-structured texts (such as web pages, news, academic papers, microblogs, etc.), and converting unstructured text into structured information through techniques such as information merging, redundancy elimination, and conflict resolution.

Information Recommendation: Information recommendation is the process of identifying information that meets user interests from the continuously incoming large-scale information based on user habits, preferences, or interests.

Information Retrieval: Information retrieval refers to the process and technology of organizing information in a certain way and satisfying user information needs through information searches.

References:

1. Statistical Natural Language Processing

2. Chinese Information Processing Report – 2016

Leave a Comment Cancel reply