Reported by New Intelligence
Reported by New Intelligence
Source: Mihail Eric
Editors: Yuanzi, Daming
[New Intelligence Guide] Alexa AI Machine Learning Scientist reviews and summarizes the content of the ACL 2019 conference, summarizing the current trends in the NLP field from multiple aspects including reducing bias, practical applications, and model integration capabilities.
This week, at ACL 2019 held in Florence, one attendee had a lot to reflect on, and that was Mihail Eric, a machine learning scientist at Alexa AI.
His work and research at Alexa are closely related to natural semantics and conversation. This conference gathered some of the best researchers in the NLP field from around the world, and the high level of expertise is self-evident.
Attending a conference feels like navigating through a torrent of knowledge. You are surrounded by papers, presentations, ideas, and exceptionally talented individuals. Hesitation can lead to wasted opportunities, while decisiveness can result in missed chances, truly a dilemma.
Fortunately, Eric is a person who excels at summarizing. He organized his observations from a week at ACL 2019, allowing us to gain a more direct understanding of the dynamics and trends in the NLP field in 2019, as well as future directions.
ACL Chair Zhou Ming pointed out in his opening speech that this year’s ACL is the largest in history, with over 2,900 papers submitted, a 75% increase over 2018! The field of natural language processing is in high demand, with academic and industry enthusiasm reaching unprecedented heights.

However, globally, the development of NLP research is extremely unbalanced, with almost all breakthrough achievements concentrated in the United States and China, which are far ahead of other countries and regions in research levels. This situation poses a risk of regional bias and a lack of diverse perspectives.
Zhou Ming, based on his experience in the NLP community in the Asia-Pacific region, pointed out a possible solution: to hold more academic conferences and activities in underrepresented regions to stimulate local NLP research enthusiasm. There are already relevant examples, such as the Deep Learning Indaba event held in Africa.
In addition to regional bias, there is also gender bias. Some papers have empirically highlighted these facts. For example, Stanovsky et al. demonstrated that four industrial machine translation systems and two current state-of-the-art (SOTA) models are very prone to gender-biased translation errors.
The NLP community is well aware of this issue. Many interesting works have been proposed to try to address the aforementioned translation problems, such as Kaneko et al. developing a method to remove biased vocabulary embeddings while retaining non-discriminatory gender-related information.
This year’s ACL also held a meeting for the first time addressing gender bias issues at the NLP workshop, expanding the scope of the NLP workshops to gather researchers studying these issues, raise awareness, and facilitate fruitful discussions.
Of course, there is still much work to be done in the NLP community, but it is encouraging to see the community taking proactive measures to mitigate bias issues.
The current state of NLP research is exciting.
Because the field of NLP is now in a stage of merging technology and applications, the models and tools developed can be applied in many scenarios to solve many practical problems, and the various NLP applications showcased at the conference made this clear.
In an era where neural network-generated fake news has become a significant problem, verifying the authenticity of narrative content is becoming increasingly important. Hengli Hu’s research established a system that utilizes acoustic and linguistic features to identify hidden information in text and speech, outperforming humans by 15%!
In the health sector, Shardlow et al. developed a neural network model to make clinical information written by doctors more readable for patients through a domain-specific phrase list. In related research, Du et al. proposed tasks for extracting symptoms from clinical dialogues and baseline models to reduce the time primary care physicians spend interacting with clinical literature systems.
This year’s ACL also included a workshop specifically discussing the application of NLP to biological problems. Fauqueur et al. proposed techniques for extracting new facts from biomedical literature without the need for training data or handcrafted rules. Rajagopal and Vyas et al. achieved a 21 F1 point improvement on standard datasets by pre-training LSTM-CRF models on large datasets and then fine-tuning them on low-resource corpora, adapting the semantic role labeling system to biological processes!
Other cool applications of NLP include research by Zhang et al., which proposed the problem of email subject line generation (similar to Gmail’s smart reply but for generating email subjects), which has shown promising results in both automatic and human evaluations.
Just as neural networks suddenly revolutionized the field of computer vision in 2011, the story of deep learning in natural language processing is also one of “explosive and rapid growth”.
From 2015 to 2017, most tasks in NLP could be solved with a relatively simple formula: embedding text input through some continuous vector representation, encoding these representations, participating in encoding representations, and then predicting tasks. Matthew Honnibal described this formalism well in an article.
Although conceptually simple, the embedding, encoding, participating, predicting formula once achieved SOTA results on almost all types of tasks, such as machine translation, question answering, and natural language inference.
Today, with the emergence of powerful pre-trained representations, training using styles of language modeling objectives like ELMO, OpenAI GPT, and BERT has become common, where these models are pre-trained on a massive scale and then fine-tuned on smaller domain-specific corpora. In fact, this strategy has successfully achieved significant SOTA results on existing NLP benchmarks.
Dai and Yang et al. sought to further push transformer-based language supermodels, greatly improving speed and achieving SOTA perplexity numbers. Another very representative work of this new paradigm is Liu and He et al. using BERT-based architectures to lead the GLUE benchmark leaderboard (at the time of submission).
Besides these works themselves, general discussions around the conference raised the question of whether many architectures can achieve several percentage points of improvement if they use something like BERT. The question then becomes: Does this new paradigm render many modeling innovations in NLP insignificant?
Eric’s personal opinion is negative. Overall, there is still much work that remains unexplored, which is crucial for driving the next iteration of NLP progress.
While existing pre-trained language supermodel architectures are very powerful, the way they are trained from raw text corpora can encourage a sense of learning enjoyment. In other words, what these models learn is quite unconstrained, and their superior performance may simply come from being able to discover many text sequence instances in different contexts within large datasets. Can we expand the sources of foundational knowledge and provide more information to extend the capabilities of NLP models beyond this range?
ACL has many papers attempting to address this issue. For instance, researchers have used typed entity embeddings and underlying knowledge graph alignment to enhance BERT representations, allowing their models to outperform BERT in entity type and relation classification. Others have tackled this issue with KT-NET, which uses attention mechanisms to fuse selected information from knowledge bases (like WordNet and NELL), refreshing SOTA on Squad 1.1.
Another notable paper is by Logan et al., which proposed a knowledge graph language model, a generative architecture that can selectively copy facts from knowledge graphs relevant to the underlying context, outperforming strong baseline language models.
While integrating knowledge into neural models is indeed a challenge, the current results seem promising!
It is well known that neural networks are black box models, making it particularly difficult to truly understand the learned decision functions. Whether the pursuit of complete interpretability for these models is absolutely necessary is debatable, but it can be said that some understanding of the internal structure of models can provide useful information for future architectural designs. Several good papers at ACL provide new insights into understanding existing models.
The Serrano team’s research shows that while attention mechanisms can be crucial for indicating model structure, in some cases, other alternative ranking metrics may more effectively explain the decision-making process of models.
The Jawahar team explored the linguistic structures learned by BERT, demonstrating that BERT’s layers learn rich linguistic information, such as surface features at lower layers, syntactic features in middle layers, and semantic features at upper layers. The authors further suggest that deeper networks are necessary for learning long-distance related information.
Other papers also focus on addressing the interpretability of models. The Gehrmann team developed a tool to detect neural network-generated fake text by visualizing the model density of predicted word tokens, improving detection accuracy by nearly 20%. The Sydorova team studied some post-hoc interpretability methods, such as LIME on QA systems, demonstrating that certain techniques can help humans identify the best-performing model among several QA models.
The concept of evaluating natural language generation remains a highly controversial topic, making this rethinking commendable.
Maxime Peyrard demonstrated that certain automatic evaluation summary metrics are inconsistent when evaluating performance within certain scoring ranges. The Clark team proposed a new evaluation metric for generated text based on sentence movement similarity, which correlates more strongly with human judgments compared to standard ROUGE.
Text generated by models is often influenced by factual errors and false statements. The Falke team studied whether natural language inference systems could be used to rearrange outputs as a method to address this issue. They found that “out-of-the-box” NLI systems are still insufficiently adapted to downstream tasks and provided some necessary tools for these inference systems to achieve the desired performance.
Maxime Peyrard also conducted more foundational research work, theoretically rigorously defining concepts such as redundancy, relevance, and informativeness.
Besides evaluation-related work, the Sankar team questioned the hypothesis that traditional recurrent networks and transformer-based seq2seq dialogue models can learn from dialogue history. Specifically, they showed that these models are not very sensitive to certain perturbations applied to context, challenging the effectiveness of dialogue natural language generators.
We often use benchmarks to measure task performance and performance improvements, and many of these models have approached or exceeded human performance on existing NLP benchmarks. So what do we do next?
This is the question posed by the Zellers team. In early research, they introduced a challenge dataset for commonsense NLP problems, but shortly after its release, it was found that BERT had achieved near-human performance. To address this, the authors proposed a follow-up dataset developed using adversarial filtering techniques to select examples that are difficult for BERT and other models to answer. In the process, they significantly increased the complexity of the benchmark.
BERT is certainly not perfect. A study by Nangia et al. showed that BERT-based models do not perform well on scarce resource sentence classification tasks and proposed a follow-up natural language understanding benchmark model named SuperGLUE, specifically designed to evaluate such tasks.
Another study by McCoy et al. showed that BERT models for natural language inference learn very simple syntactic heuristic methods, which do not generalize well to other task instances. They also released an evaluation set to determine whether models adopt these heuristic algorithms, addressing more general reasoning problems.
In summary, my feeling is that most current models are still addressing dataset issues rather than tackling real tasks. The models we build excel at selecting and leveraging dataset-specific biases. In the process, the evaluation metrics we devise depict a rather misleading scenario. This reminds me of Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. So how do we move forward?
Given that these evaluation standards are proxies for natural language tasks, and with the rapid progress in model development, it seems unreasonable to keep benchmarks static. Instead, I find a particularly promising path is to develop a set of dynamically changing benchmarks of increasing difficulty, each capable of further pushing the enhancement of natural language capabilities. Perhaps the performance limits of this set of benchmarks can achieve human-level NLP performance in machines.
From the papers presented at this ACL, the NLP field is thriving! The community is in a very exciting period with many promising research directions. Despite substantial progress made in the NLP field over the past year, many prominent challenges and unresolved issues remain to be addressed.
