
Long press the QR code to follow~AI Beginner
ID: StudyForAI
Learning AI
Looking forward to meeting you~
This article is a summary based on the author’s understanding through learning and researching related materials, briefly introducing some related technologies and tasks in natural language processing (NLP). NLP technologies include foundational and applied technologies. The author will continue to improve this series in the future. Due to the author’s limited level, there may inevitably be mistakes and omissions, and readers are welcome to correct them.
Development
It is generally believed that the famous “Turing Test” proposed by Turing in 1950 marks the beginning of the idea of natural language processing. From the 1950s to the 1970s, natural language processing primarily adopted rule-based methods. Rule-based methods cannot cover all statements and require high demands on developers. At this time, natural language processing was still in the stage of rationalism.
After the 1970s, with the rapid development of the internet, the richness of corpora, and the improvement of hardware, the thought of natural language processing transitioned from rationalism to empiricism, with statistical methods gradually replacing rule-based methods.
From 2008 to now, due to breakthroughs in deep learning in fields such as image recognition and speech recognition, people have gradually begun to introduce deep learning into natural language processing research, from the initial word vectors to the 2013 word2vec, pushing the combination of deep learning and natural language processing to a climax, achieving certain successes in machine translation, question-answering systems, reading comprehension, etc. Recently, models like emlo and BERT may be unveiling the next chapter.
Definition
Natural language refers to languages such as Chinese and English that people use in daily life, which have evolved naturally with the development of human society, not artificial languages. Natural language is an important tool for human learning and living. In other words, natural language refers to the conventions of human society, distinguishing it from artificial languages, such as programming languages.
Processing includes understanding, transforming, and generating. Natural language processing refers to the use of computers to process the form, sound, and meaning of natural language, i.e., operations and processing on the input, output, recognition, analysis, understanding, and generation of characters (if in English, characters), words, sentences, paragraphs, and texts. Achieving information exchange between humans and machines is an important issue of common concern in artificial intelligence, computer science, and linguistics. Therefore, natural language processing is also hailed as the pearl of artificial intelligence.
It can be said that natural language processing aims for computers to understand natural language. The mechanisms of natural language processing involve two processes: natural language understanding and natural language generation. Natural language understanding means that computers can understand the meaning of natural language texts, while natural language generation means expressing given intentions in natural language texts. The understanding and analysis of natural language is a hierarchical process, and many linguists divide this process into five levels, which better reflect the composition of language itself. The five levels are phonetic analysis, lexical analysis, syntactic analysis, semantic analysis, and pragmatic analysis.
Phonetic analysis involves distinguishing independent phonemes from the phonetic stream according to phonemic rules and identifying syllables and their corresponding lexemes or words based on phonemic morphology rules.
Lexical analysis involves identifying various morphemes of vocabulary to obtain linguistic information.
Syntactic analysis analyzes the structure of sentences and phrases, aiming to identify the relationships between words, phrases, etc., and their roles in the sentence.
Semantic analysis refers to using various machine learning methods to learn and understand the semantic content represented by a segment of text. Semantic analysis is a very broad concept.
Pragmatic analysis studies the influence of the external environment on language users.
Foundational Technologies
Foundational technologies include lexical analysis, syntactic analysis, semantic analysis, etc.
Lexical Analysis
Lexical analysis includes Chinese word segmentation (word segmentation or tokenization) and part-of-speech tagging.
Chinese word segmentation: The primary task of processing Chinese (which has inherent word segmentation) is to segment the input string into individual words, a step known as word segmentation.
Part-of-speech tagging: The purpose of part-of-speech tagging is to assign a category to each word, known as a part-of-speech tag. For example, noun (noun), verb (verb), etc.
Syntactic Parsing
Syntactic parsing is the process of analyzing the input text sentence to obtain its syntactic structure. The most common syntactic parsing tasks include the following:
Phrase structure syntactic parsing: This task, also known as constituent syntactic parsing, aims to identify the phrase structure in a sentence and the hierarchical syntactic relationships between phrases.
Dependency syntactic parsing: This task aims to identify the interdependencies between vocabulary in a sentence.
Deep grammar syntactic parsing: This involves using deep grammar, such as Lexicalized Tree Adjoining Grammar (LTAG), Lexical Functional Grammar (LFG), Combinatory Categorial Grammar (CCG), etc., to conduct deep syntactic and semantic analysis of sentences.
Semantic Analysis
The ultimate goal of semantic analysis is to understand the true semantics expressed by a sentence. However, what representation form semantic content should take has long troubled researchers, and there is still no unified answer to this question. Semantic role labeling is a relatively mature shallow semantic analysis technique.
In summary, natural language processing systems typically adopt a cascading approach, i.e., word segmentation, part-of-speech tagging, syntactic analysis, and semantic analysis, training models separately for each. During use, given an input sentence, each module is used sequentially for analysis, ultimately obtaining all results.
In recent years, researchers have proposed many effective joint models that jointly learn and decode multiple tasks, such as joint word segmentation and part-of-speech tagging, joint part-of-speech and syntactic parsing, joint word segmentation, part-of-speech tagging, and syntactic parsing, as well as joint syntactic and semantic analysis, achieving good results.
Applied Technologies
On the other hand, applied technologies in natural language processing often rely on foundational technologies, including Text Clustering, Text Classification, Text Summarization, Sentiment Analysis, Question Answering (QA), Machine Translation (MT), Information Extraction, Information Recommendation, and Information Retrieval (IR).
Since each task involves many components, I will briefly summarize these tasks here and provide detailed summaries of various technologies later as my learning deepens.
Text Classification: The text classification task automatically assigns predefined category labels based on the content or theme of a given document. This includes single-label classification and multi-label text classification.
Text Clustering: This task partitions a collection of documents into several subsets based on the content or thematic similarity between documents, where documents within each subset are highly similar, while the similarity between subsets is low.
Text Summarization: The text summarization task refers to compressing and refining the original text to provide users with a concise description.
Sentiment Analysis: The sentiment analysis task refers to using computers to analyze and mine opinions, sentiments, attitudes, and emotions from text data.
Question Answering: Question answering refers to using computers to automatically answer questions posed by users to meet their knowledge needs.
Machine Translation: Machine translation refers to using computers to automatically translate from one natural language to another. The language being translated is called the source language, while the language being translated into is called the target language.
Information Extraction: Information extraction refers to extracting specified types of information (such as entities, attributes, relationships, events, product records, etc.) from unstructured/semi-structured texts (such as web pages, news articles, academic papers, microblogs, etc.), and converting unstructured text into structured information through information merging, redundancy elimination, and conflict resolution techniques.
Information Recommendation: Information recommendation is the process of identifying information that meets user interests from a large volume of incoming information based on user habits, preferences, or interests.
Information Retrieval: Information retrieval refers to organizing information in a certain way and using information search to satisfy user information needs.
References:
1. Statistical Natural Language Processing
2. Chinese Information Processing Report – 2016
This article is submitted by the author Le Yuquan (yuquanle). Author’s WeChat public account: AI Beginner (id: StudyForAI). Welcome to follow. Click “Read the original text” at the end of the article for direct access to the original link, and everyone is also welcome to submit articles related to AI and NLP.