Delivering NLP technical insights to you every day!

Reprinted from | PaperWeekly

Author | Albert Yang

Affiliation | Amazon/Georgia Tech

Research Direction | NLP

As a procrastinator, I’m writing this summary of the ACL conference just before NAACL 🙁 .

At the onsite conference, I saw my former boss Bonnie Webber (ACL 2020 Lifetime Achievement Award) after a long time. Coincidentally, she hosted my boss Diyi Yang’s Rising Star Talk, and Diyi also had an outstanding paper and a tutorial, which probably hints that Diyi will soon become a legend in the field 🙂 . Next, I mainly look at NLP hotspots and the latest interesting work from the perspective of large models, based on the tutorials, workshops, and invited talks (the next big ideas, keynote, rising star talks) from ACL 2022.

1

Continuing Pre-training Large Models Remains One of the Major Directions in Industry

In the tutorial “Zero- and Few-Shot NLP with Pretrained Language Models”, Iz Beltagy introduced concerns about pre-training in the final part.

In addition to discussing standard model architectures and efficient training methods, I believe there are two points worth noting. The first is that before starting pre-training, one should estimate the optimal model size based on empirical formulas given the available computational resources. For example, an OPT-175B model requires 1000 A100 80G GPUs to train for two months. Even in the industry, very few labs have such resources, making it impossible to allow multiple experimental attempts to determine the optimal model. The second is that the selection and construction of pre-training data should receive more attention.

Additionally, there was a workshop specifically discussing model pre-training: “Workshop on Challenges & Perspectives in Creating Large Language Models”.

2

Let Large Models Solve More Comprehensive NLP Problems and Problems Beyond NLP

This includes extending large-scale Transformer models to multi-task, multimodal, and multilingual settings.

2.1 Cross-task Generalization

2.1.1 Instructions as Task Descriptions

Similar to FLAN, T0, and InstructGPT, “NaturalInstructions” also uses instructions as part of the prompt for task descriptions, allowing models pre-trained on multiple known tasks (meta-training) to achieve cross-task generalization based on instructions for unknown tasks:

“Cross-Task Generalization via Natural Language Crowdsourcing Instructions.” (and the subsequent “NaturalInstructions V2”)
“MetaICL: Learning to Learn In Context” (Meta-training refers to letting the model see different task instructions during the training phase rather than directly providing task instructions to GPT3 during the inference phase).
“Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections” (earlier work)
“Meta-learning via Language Model In-context Tuning”

2.1.2 Continual / Lifelong Learning

Some specific methods can help with Continual/Lifelong Learning, such as “Continual Sequence Generation with Adaptive Compositional Modules”, which uses a method similar to MoE to combine different task modules.

2.2 Multimodal Learning

My multimodality paper list has not been updated for a few months and is somewhat out of date:

https://github.com/JingfengYang/Multi-modal-Deep-Learning

2.2.1 Vision

As Transformers show better performance in vision tasks, multimodal pre-training has received significant attention at the ACL conference. The latest developments can be found in the tutorial: “Vision-Language Pretraining: Current Trends and the Future”. However, the current goals of vision-language pre-training mostly focus on language vocabulary masking or determining visual-language pairs, and the masking recovery of images (MAE method) has not seen significant progress in multimodal pre-training. The tutorial pointed out that this is a promising research breakthrough point.

Indeed, due to the rich semantic information of vocabulary, using it as a supervisory target yields better results (the workshop “Learning with Natural Language Supervision” hosted by Jacob Andreas also emphasized this point). How to effectively utilize the low-density semantic information of images (like BEiT) as a target in multimodal pre-training remains to be explored.

2.2.2 Tables

Multimodal pre-training also includes more modalities, such as language and tables. Similar to vision-language pre-training, designing better pre-training objectives and achieving better alignment between tables and text are also worthy of exploration. My article at Google further explored this direction: “TableFormer: Robust Transformer Modeling for Table-Text Encoding”.

2.2.3 Code

Additionally, code pre-training is increasingly being adopted by many companies in the industry. Besides OpenAI’s Codex and Microsoft’s paid CodePilot, AWS has released CodeWhisperer, and Luke at Meta is also pre-training code generation models. Semantic parsing is likely to be dominated by these models, or at least rely on code generation models to help generate data to solve the problem of lacking training data (as strongly recommended by Jacob Andreas during our chat).

2.3 Multilingual Learning

2.3.1 Interesting Directions

In addition to the continuous emergence of more multilingual pre-training models and applications for more downstream tasks (e.g., “mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models”), I find the following aspects particularly interesting and worth exploring:

1. “Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models” proposes a stricter multi-task framework to predict the zero-shot cross-lingual transfer performance of multilingual models without needing evaluation in the target low-resource language, even when there is no annotated data in the low-resource language as a test set.

2. The reason why pre-trained multilingual models perform well remains an unresolved question. “Cross-Lingual Ability of Multilingual Masked Language Models: A Study of Language Structure” points out that vocabulary overlap (anchor) is not the reason for the cross-lingual ability of pre-trained multilingual models, nor is the order of constituents; rather, it is the combination of semantics. Previous studies about word anchors have yielded similar or opposing conclusions.

3. Regarding how to pre-train better multilingual models and cover a more diverse range of languages, using subwords or characters within the same language family (like Indo-European languages) can provide more shared vocabulary as anchors (for example, “Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation”).

For languages that are very different, such as Chinese and English, I believe we should also find ways to help the model learn shared grammatical structures between different languages, such as Universal Dependency (UD) structures, which have been shown to significantly aid zero-shot cross-lingual semantic parsing in our previous work (“Frustratingly Simple but Surprisingly Strong: Using Language-Independent Features for Zero-shot Cross-lingual Semantic Parsing”).

2.3.2 Special Theme

This year’s ACL special theme is “Language Diversity: from Low-Resource to Endangered Languages”. In the rising star talk, Sebastian Ruder gave a presentation titled “Scaling NLP Systems to the Next 1000 Languages”. Indeed, the NLP issues for languages with extremely limited corpora are critically important from a social impact and fairness perspective.

3

Making Good Use of Large Models

The academic community does not have vast computational resources to pre-train models, yet there are still many valuable questions to explore, such as how to effectively utilize pre-trained models. How to use large-scale models?

3.1 Decoding / Sampling

For extremely large models, in most scenarios, fine-tuning is not feasible. Therefore, designing more effective decoding and sampling methods to directly leverage the model’s generative capabilities is a key area of research. For instance, Ryan Cotterell proposed a sampling algorithm in “Typical decoding for natural language generation” that can generate more natural language and reduce the issue of repetitive generation.

How to design better constrained decoding algorithms to achieve controllable generation remains a focus, such as “COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics”, and I particularly like the method of constrained decoding for Information Extraction (IE) in “Multilingual Autoregressive Entity Linking” (our work “SEQZERO: Few-shot Compositional Semantic Parsing with Sequential Prompts and Zero-shot Models” also used a similar method).

Moreover, non-autoregressive generation/multi-stage generation remains a common approach (I have made similar attempts in my previous work “Planning and Generating Natural and Diverse Disfluent Texts as Augmentation for Disfluency Detection”).

3.2 Prompt

Prompt-based methods have become a primary way to leverage large-scale models (Tsinghua’s OpenPrompt won the best demo award). Besides common prompts, in-context learning (similar to GPT3 providing few-shot input-output examples) has emerged, as well as using generated explanations to help the model achieve better results, and incorporating instructions as part of the prompt, all of which have become common methods to further enhance generation results. Some interesting papers mentioned in conferences/tutorials/talks include:

“Noisy channel language model prompting for few-shot text classification”
“Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?”
“Can Explanations Be Useful for Calibrating Black Box Models?”
“The Unreliability of Explanations in Few-Shot In-Context Learning”
“Cross-Task Generalization via Natural Language Crowdsourcing Instructions”
“Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations”

3.2 Efficient Models

Designing more efficient models (model compression, quantization, adapters, etc.) remains a hot topic, such as:

“Structured Pruning Learns Compact and Accurate Models”

3.3 Language Models as KG

Viewing large models as knowledge bases can help us generate knowledge that aids in solving tasks:

“Generated Knowledge Prompting for Commonsense Reasoning”

3.4 Language Models to Generate Data

The powerful generative capabilities of large models or their zero-shot/few-shot abilities can help generate labeled data and augment data, such as “Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets” (our EMNLP 2020 work “Planning and Generating Natural and Diverse Disfluent Texts as Augmentation for Disfluency Detection” can also be considered one of the earliest works using pre-trained models (GPT2) to generate augmented data; at that time, BART and T5 had just been released, and GPT3 was not yet available).

3.4 Zero/Few-shot Learning & Learning with Limited Data

In the era of large models, we can better learn in scenarios with limited data (limited data learning) or few/zero-shot learning. Two extremely popular tutorials are excellent materials for understanding related work:

“Learning with Limited Text Data”: My boss Diyi Yang introduced relevant work on data augmentation (welcome to check out our recent work “SUBS: Subtree Substitution for Compositional Semantic Parsing”). Colin Raffel proposed a unified framework for understanding various semi-supervised learning methods, and Ankur Parikh introduced it from the perspective of multilinguality (thanks to Ankur Parikh for reviewing our ACL TableFormer work internally at Google; we mentioned him in the acknowledgments).
“Zero- and Few-Shot NLP with Pretrained Language Models” specifically introduced prompting/in-context learning, instructions/task descriptions, adapters, meta-training, evaluation, and pre-training.

4

What Large Models Cannot Achieve

Additionally, the academic community is more focused on the problems that large models cannot solve, which are determined by the nature of the model or the problem itself, as well as the inherent flaws of the pre-training framework.

4.1 Ambiguity

Yejin Choi mentioned the phenomenon of ambiguity in the Keynote, which can cover a large portion of the issues. She pointed out that ambiguity is an inherent property of natural language, and natural language understanding is not a strict classification problem (“language understanding is not categorization”); we should accept the ubiquitous ambiguity. The definitions of POS in the most fundamental NLP task, POS tagging, change over time; given different contexts, the NLI relationship between two sentences may shift from entailment to contradiction (“Partial-input baselines show that NLI models can ignore context, but they don’t.”).

Sentiment classification initially had only positive and negative labels, later introducing a neutral label; due to individual differences among annotators, human annotations inevitably contain ambiguity and bias (“Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection”); there are also datasets like AmbigQA and SituatedQA in automatic question answering (Eunsol Choi emphasized in the rising star talk that the answers to the same question may vary with temporal and geographical contexts).

In nonmonotonic reasoning, the introduction of new knowledge can overturn previous inferences and logic. Recently, temporal modeling itself has also become a hot area (e.g., TKGC, modeling of temporal/event data).

Furthermore, how models understand ambiguous data and how to leverage ambiguous data to enhance models have many interesting works. Swabha Swayamdipta highlighted in the rising star talk the work of discovering ambiguity using training dynamics and generating ambiguous data to help improve model (OOD) generalization (“WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation”).

4.2 Reasoning / Logic / Structure

In the “the next big ideas” talk, the inherent flaws of large models regarding logic, reasoning, and structure were once again emphasized. Heng Ji stressed the importance of structure in multilingual transfer (for example, our previous work “Frustratingly Simple but Surprisingly Strong: Using Language-Independent Features for Zero-shot Cross-lingual Semantic Parsing”), long text understanding, and multimodal generalization.

Dan Roth mentioned that the decision-making process of knowledge decomposition, recomposition, and planning is key to achieving reasoning (such as temporal/numerical reasoning), and how to leverage various incidental supervision signals (such as comparable texts/language-world mapping) is a way to learn this decision-making process. It feels somewhat similar to Zhiting Hu’s Panoramic Learning—training AI agents with ALL types of experiences, haha.

The debate over whether symbolic reasoning is still necessary continues; on one hand, Hang Li and others emphasize the importance of logic (combining Neural Prediction and Symbolic Prediction using a method similar to MoE). On the other hand, Yejin Choi, in the continuum part of her keynote, stated that with the success of large models, “language, knowledge, reasoning” should be integrated in the era of large models, and we have previously overemphasized the role of form and logic (“Reasoning is intuitive inference where logic plays a marginal role”); using formal language and logic to cover all variations in natural language is impossible.

4.3 Out-of-distribution (OOD) Generalization & Robustness

The ability of large models to generalize on out-of-distribution data remains one of the most critical issues in their practical applications.

In language, the focus on compositionality has significantly increased. During a conversation with Luke, he mentioned that their recent code pretrained model showed significant improvements in compositional generalization as the model size increased. In discussions with Jacob Andreas, he emphasized the role of data in compositionality (including data augmentation and leveraging large models to generate data). Sash (Alexander Rush) seems to have recently shown great interest in compositionality; unfortunately, I haven’t found an opportunity to chat with him.

Additionally, using large models for incremental prompting has become a relatively popular approach to enhance compositionality. More specific details can be found in my article on compositionality and our two NAACL works: Can we achieve general artificial intelligence within a decade? First, let’s clarify the research on compositional generalization!

Regarding robustness, using out-of-distribution/perturbed data to attack models to test or enhance them continues to yield new papers (such as our “TableFormer: Robust Transformer Modeling for Table-Text Encoding”).

4.4 Long Document Understanding / Generation

This includes the understanding and generation of corpus/discourse/story/screenplay/long dialogue/movie/TV series, etc.

The understanding and generation of long texts remain one of the biggest challenges for large models. One solution is to enhance the sequence length that models can encode and improve self-attention efficiency; another is to retrieve important short texts before encoding; yet another is to conduct multi-level encoding or decoding through structure. In the “the next big ideas” talk, Heng Ji emphasized the importance of corpus-level IE, while Mirella Lapata highlighted the significance of stories.

4.5 Knowledge

Regarding knowledge graphs (KG) in the era of large models, Heng Ji mentioned possible usages: 1) To pretrained LM 2) GNN 3) Structural constraints during inference 4) Structure alignment via weak supervision and self-supervised learning.

Large models themselves can also serve as knowledge bases (generating knowledge) or assist in the construction of KGs, as Yejin Choi has a series of works on constructing and utilizing commonsense KGs.

Semi-parametric methods have also become mainstream, and retrieval-augmented methods have been widely applied in understanding and generation tasks. There are still many interesting works emerging in this area, such as “Training Language Models with Memory Augmentation”.

Additionally, the “Semiparametric Methods in NLP: Decoupling Logic from Knowledge” workshop is one of my favorites, covering most related directions, and the work mentioned by DeepMind on using retrieval methods for protein structure prediction truly amazed me after not engaging with biology for a long time.

4.6 Problem Definition / Dataset Creation / Evaluation

Edaurd Hovy mentioned in the big ideas talk that we should consider the problems themselves, identify what is wrong/worst case/never seen cases, understand “why things go wrong”, and then seek solutions. This is also a method I have always believed should be followed in research and engineering: conduct thorough error analysis to identify issues and address them accordingly.

On the other hand, what is most important in NLP should not be the model itself; humans should engage proactively to better define problems, construct datasets, and conduct better evaluations (evaluation remains a major challenge in generation).

5

The Purpose of Large LMs: To Help People Instead of Replacing Them

5.1 Interactive Learning / Human-in-the-loop / Human-AI Collaboration

Eduard Hovy mentioned in the big ideas talk that besides the relatively objective knowledge in LMs or the web (Commonsense knowledge about Schema mined from web/LM), human and social knowledge is also extremely important (Commonsense knowledge about people and people in groups: roles).

Moreover, humans should guide models to achieve desired goals. I believe this is part of the purpose of interactive learning and human-in-the-loop learning as popular research topics. For instance, interesting works include Eunsol Choi’s “Simulating Bandit Learning from User Feedback for Extractive Question Answering” and Yejin’s mention of “Reframing human-ai collaboration for generating free-text explanations”.

5.2 SocialNLP

My boss Diyi Yang’s rising star talk detailed how human and social factors should play a greater role in NLP (I’m glad to witness my lifetime achievement award-winning boss Bonnie hosting the Rising Star talk). Additionally, Diyi’s outstanding paper “Inducing Positive Perspectives with Text Reframing” defines the socially impactful issue of “positive reframing”, and I’m happy to have made a small contribution to this work.

5.3 Complex Tasks

As the capabilities of large models continue to grow, they may be able to tackle more complex tasks that humans care about, such as story understanding and storytelling that Mirella Lapata mentioned. I particularly like her point that “stories make us human”.

5.4 Security/Privacy

The security issues of large models remain a focus, with a workshop on “Federated Learning for Natural Language Processing” at this ACL. Privacy-related articles continue to be worth attention, such as “Are Large Pre-Trained Language Models Leaking Your Personal Information?”.

5.5 Personalization

Personalization has garnered significant attention in both industry (search, recommendations, advertising) and academia. I was quite surprised when Jason Eisner mentioned in our conversation that he is also very interested in personalization and looks forward to collaborating with the industry.

6

Conclusion

The onsite conference experience was quite good; I was most happy to have ample time for face-to-face communication with many big names, and I learned a lot from the papers/talks/tutorials.

NLP Hotspots and Interesting Work from ACL 2022

1 Continuing Pre-training Large Models Remains One of the Major Directions in Industry

2 Let Large Models Solve More Comprehensive NLP Problems and Problems Beyond NLP

3 Making Good Use of Large Models

4 What Large Models Cannot Achieve

5 The Purpose of Large LMs: To Help People Instead of Replacing Them

6 Conclusion

Leave a Comment Cancel reply