PromptCLUE: Large-Scale Multi-Task Prompt Pre-Trained Chinese Open Source Model

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP master’s and PhD students, university teachers, and corporate researchers.

The vision of the community is to promote communication and progress between the academic and industrial sectors of natural language processing and machine learning at home and abroad, especially for beginners.

Reprinted from | CLUE Chinese Language Understanding Evaluation Benchmark

Introduction

PromptCLUE: Large-scale multi-task prompt pre-trained Chinese open-source model.

The three major unifications in Chinese: unified model framework, unified task format, unified application method.

Supports dozens of different types of tasks, with good zero-shot and few-shot learning capabilities. For understanding tasks such as classification, sentiment analysis, extraction, etc., a custom label system can be defined; for generation tasks, free sampling generation can be performed.

Pre-trained on hundreds of billions of Chinese tokens and trained on hundreds of millions of Chinese task data, with over 150 training tasks. Average task performance improved by more than 7 points compared to the base version; it has better understanding, generation, and extraction capabilities, and supports text rewriting, error correction, and knowledge graph question answering.

Unified model framework: adopts a text-to-text generative pre-training model for unified modeling.

Unified task format: Prompt unifies the differences between different NLP tasks, transforming them into a unified text-to-text data format.

Unified application method: forms a ready-to-use model for target tasks, which can be transformed into a unified prompt adaptive method for downstream applications, enabling zero-shot/few-shot testing.

Project address:

https://github.com/clue-ai/PromptCLUE

DEMO address:

www.CLUEbenchmarks.com/clueai.html

Online DEMO:

Effect Comparison – 16 Types of Chinese Tasks

Related Technologies and Training Process

The three major unifications: unified model framework (text-to-text), unified task format (prompt), unified application method (zero-shot/few-shot). (T0)
Large-scale pre-training: Based on the t5-large version, trained with hundreds of GB of Chinese corpus for 1 million steps, accumulating 1.5 trillion Chinese character-level tokens.
Large-scale task data: Used 16 types of task types, hundreds of tasks, accumulating hundreds of millions of task data.
Mixed pre-training: On one hand, downstream tasks are used as pre-training corpus, and on the other hand, downstream tasks and pre-training corpus are trained together, reducing task catastrophic forgetting and shortening the distance between pre-training and downstream tasks, better adapting to downstream tasks (ExT5).
Mixed sampling: For numerous tasks with greatly varying data volumes, proportional sampling of all tasks within each training batch is adopted, smoothing sampling based on task data volume while also limiting the upper limit of task data sampling pool. Smoothing sampling can reduce the biased harm of task training, and training within each batch can reduce negative transfer between heterogeneous tasks (T5).
Staged training: On one hand, it refers to staged training in pre-training, involving staged training sequence lengths (128 and 512) to accelerate pre-training speed (Bert); on the other hand, in downstream training, it involves changes in learning rates and sequence lengths, as well as a gradual decrease in the data volume limit of downstream tasks, better adapting to different downstream tasks.
Increased language model training: Referencing t5.1.1, in addition to using the Span Corruption method for unsupervised training, prefix LM training is also employed to enhance generative task capabilities (LM adapted).
Increased training for the model’s encoder and decoder: Pre-training data corpus is constructed for downstream task data, added to pre-training, enhancing the encoder’s understanding capability and the decoder’s generation capability (see UIE).
Reconstructed model vocabulary: Using sentencepiece to learn and construct model vocabulary on hundreds of billions of tokens, more in line with Chinese language habits.

4

『Supported Tasks (Partial)』

Intent Classification
News Classification
Sentiment Analysis
Natural Language Inference
Reading Comprehension
Reading Comprehension - Free Style
Summarization
Translation - Chinese to English
Translation - English to Chinese
General Information Extraction
Resume Information Extraction
Medical Information Extraction
E-commerce Customer Demand Analysis
Medical Semantic Similarity
Question Generation
Coreference Resolution
Keyword Extraction
Sentiment Tendency

Generating based on the title article
Knowledge Graph Question Answering

Similar Sentence Generation/Rewriting
Error Correction

Central Word Extraction
.....
5

『Performance on pCLUE Benchmark』

6

『Example Input』

News Classification (classify)
Input:
Classification task:
The fund Taihe, which was sold off due to a low discount rate, fell 7.15%. Securities Times reporter Zhu Jingfeng reported that due to the discount rate being the lowest among the closed-end funds in the market, Taihe Fund was heavily sold off by investors yesterday, with a decline of 7.15%, far exceeding the market. The market showed that Taihe Fund opened high with the market but then started to fluctuate downwards, accelerating the decline in the afternoon, with almost no significant rebound. By the end of the trading, while the CSI 300 index only fell by 2.56%, Taihe Fund's closing drop reached 7.15%, the largest drop among all closed-end funds, while most closed-end funds yesterday fell by about 2%.
Options: Finance, Entertainment, Politics, Stocks
Answer:

Model output:
Finance
-----------------
Intent Classification (classify)
Input:
Intent classification:
Help me book a room in Pudong, Shanghai on Sunday.
Options: Alarm, Literature, Hotel, Art, Sports, Health, Weather, Others
Answer:

Model output:
Hotel
-----------------
Sentiment Analysis (classify)
Input:
Sentiment analysis:
This looks okay, but I actually don't like it.
Options: Positive, Negative
Answer:

Model output:
Negative
-----------------
Inference (generate)
Input:
Please infer the relationship between the context:
Premise: Sorry, this is how it is.
Hypothesis: This is how it is, no need to apologize.
Options: Neutral, Entailment, Contradiction
Answer:

Model output:
Contradiction
-----------------
Reading Comprehension (generate)
Input:
Read the article and provide the answer:
Paragraph:
The Hong Kong dollar's effective exchange rate index (Effective Exchange Rate Index for the Hong Kong Dollar) is compiled by the Hong Kong government's Census and Statistics Department to reflect the changes in the nominal effective exchange rate of the Hong Kong dollar against the currencies of Hong Kong's major trading partners. The weighting is based on the average trade pattern from 1999 to 2000, but the government has not published a detailed calculation formula. The old Hong Kong dollar index's base date was January 1, 2000, with a base of 100 points. From January 3, 2012, the new series of the Hong Kong dollar index (including 15 currencies and with 2010 as the base = 100) has replaced the old Hong Kong dollar index series. The role of the Hong Kong dollar index is mainly to reflect the price changes of goods and services in Hong Kong relative to its major trading partners and is generally regarded as an indicator of Hong Kong's price competitiveness.
Question: How is the weighting of the Hong Kong dollar index determined?
Answer:

Model output:
Based on the average trade pattern from 1999 to 2000.
-----------------
Reading Comprehension - Free Style (generate)
Input:
Read the following dialogue and answer the question.
Male: Why are you so late to work today? Female: I worked late yesterday, and I also have a cold. Male: Then go back and rest; I will help you take a leave. Female: Thank you.
Question: How is the female doing?
Options: Working, Having a cold, On the phone, Going on a business trip.
Answer:

Model output:
Having a cold
-----------------
Summarization (generate)
Input:
Generate a summary for the following article:
On September 5 at 12:52 PM Beijing time, a 6.8 magnitude earthquake struck Luding County, Ganzi Tibetan Autonomous Prefecture, Sichuan. After the earthquake, leaders attached great importance and issued important instructions, requiring that saving lives be the top priority and all-out efforts be made to rescue the affected people, minimizing casualties.
Answer:

Model output:
A 6.8 magnitude earthquake struck Luding County, Ganzi on September 5.
-----------------
Translation - Chinese to English (generate)
Input:
Translate into English:
The speaker went to Taiwan, and the Chinese people are very angry.
Answer:
-----------------
Model output:
The speaker went to Taiwan, and the Chinese people are very angry.

Translation - English to Chinese (generate)
Input:
Translate into Chinese:
This is a dialogue robot that can talk to people.
Answer:
-----------------
Model output:
这是一台可以与人交谈的对话机器人。

General Information Extraction (generate)
Input:
Extract information:
According to Xinhua News Agency, the government of Qingcheng District, Qingyuan City, Guangdong Province, announced yesterday that the director of the Environmental Protection Bureau of Qingcheng District, Chen Bo, who was recently reported for suspected extortion of enterprises, saying "can destroy a factory in minutes," has been removed from his position as a member of the Qingcheng District Committee.
Question: Organization names, person names, positions
Answer:
-----------------
Model output:
Organization names: Xinhua News Agency, Qingcheng District Government, Qingcheng District Environmental Protection Bureau, Qingcheng District Committee.
Person names: Chen Bo.
Positions: Director, Committee member.

Resume Information Extraction (generate)
Input:
Read the text and extract key information:
Zhang Xuanwu, born in 2000, Chinese nationality, no foreign residency rights, PhD degree, currently the Technical Director of Hangzhou Xiansuo Technology.
Question: Organization, person name, position, origin, major, nationality, education, ethnicity
Answer:
-----------------
Model output:
Person name: Zhang Xuanwu.
Position: Technical Director of Hangzhou Xiansuo Technology.
Nationality: Chinese nationality.
Education: PhD degree.
-----------------
Medical Information Extraction (generate)
Input:
Extract information from the text:
The patient is in good spirits, has a good diet, sleeps well, and normal bowel movements. The patient's dizziness symptoms have reduced significantly after comprehensive treatment, and the patient continues to take medication to improve cerebral blood vessels and adjust blood pressure changes.
Question: Symptoms, treatment, examination, body parts, diseases
Answer:

Model output:
Symptoms: Dizziness.
Treatment: Medication to improve cerebral blood vessels and adjust blood pressure changes.
Body parts: Bowel movements.
-----------------
E-commerce Customer Demand Analysis (classify)
Input:
E-commerce customer inquiry classification:
Received but not quite fitting, can I return or exchange?
Options: Buyer inquires whether the product supports Huabei payment, buyer expresses interest in the store, buyer inquires about return and exchange rules, buyer needs product recommendations.
Answer:

Model output:
Buyer inquires about return and exchange rules.
-----------------
Medical Semantic Similarity (classify)
Input:
Do the following sentences express the same meaning:
Text 1: What to do about numb legs from diabetes?
Text 2: How to control lifestyle with diabetes?
Options: Similar, Not similar.
Answer:

Model output:
Not similar.
-----------------
Question Generation (generate)
Input:
Question generation:
On September 22, 2022, the Ministry of Commerce held a regular press conference, and the spokesperson of the Ministry of Commerce, Shu Jueting, stated that from January to August this year, China actually used foreign capital of 138.4 billion US dollars, an increase of 20.2%; among them, EU investment in China increased by 123.7% (including data from free port investments). This fully indicates that foreign investors, including the EU, continue to have confidence in the Chinese market and hope to deepen investment cooperation with China.
Answer:

Model output:
1. How much did EU investment in China increase from January to August?
2. Did the increase in EU investment in China meet expectations?
3. What results has the EU achieved in investment cooperation with China?
-----------------
Coreference Resolution (generate)
Input:
Coreference resolution:
Paragraph:
Shaoping and Runye entered her second uncle's yard, and Runsheng walked over and said to him (pronoun): "I looked for you in the dormitory twice; where have you been?"
Question: Who does the pronoun "he" refer to?
Answer:

Model output:
Shaoping.
-----------------
Keyword Extraction (generate)
Input:
Extract keywords:
On the 21st local time, the US Federal Reserve announced a rate hike of 75 basis points, raising the federal funds rate target range to between 3.00% and 3.25%, in line with market expectations. This is the fifth rate hike by the Fed this year and the third consecutive rate hike, marking the largest intensity of rate hikes since 1981.
Keywords:

Model output:
Federal Reserve, rate target range, rate hike, basis points.
-----------------
Sentiment Tendency (classify)
What kind of sentiment is contained in the text:
Super cute handsome guy, love it...
Options: Disgust, Like, Happy, Sad, Surprised, Angry, Afraid.
Answer:

Model output:
Like.
-----------------

Technical Communication and Feedback

Related Materials

t5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
t5 repo
T0: Multitask Prompted Training Enables Zero-Shot Task Generalization
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
Post-Prompt Era | NLP Unified Paradigm: Pre-training + Large-Scale Multi-Task Learning
ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
Natural-Instructions v2: Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
LM adapted: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
UIE: Unified Structure Generation for Universal Information Extraction
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

Click to read the original text and enter the online DEMO!

Technical Communication Group Invitation

△ Long press to add the assistant

Scan the QR code to add the assistant’s WeChat

Please note: Name – School/Company – Research Direction

(e.g., Xiao Zhang – Harbin Institute of Technology – Dialogue Systems)

to apply to join the Natural Language Processing/Pytorch and other technical communication groups.

About Us

MLNLP community is a grassroots academic community jointly built by machine learning and natural language processing scholars from home and abroad, and has developed into a well-known machine learning and natural language processing community at home and abroad, aiming to promote progress between the academic and industrial sectors of machine learning and natural language processing and enthusiasts.

The community can provide an open communication platform for further education, employment, and research for related practitioners. Everyone is welcome to follow and join us.

About Us

Leave a Comment Cancel reply