Gzip + kNN Text Classification Beats Transformer with 14 Lines of Code

A few days ago, the ACL 2023 awards were announced, attracting significant attention.

Among the many papers included, one titled “Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors” has started to generate much discussion. This paper was jointly completed by the University of Waterloo and AFAIK, but it is neither an award-winning paper nor a main conference paper.

Gzip + kNN Text Classification Beats Transformer with 14 Lines of Code

Paper Link:

https://aclanthology.org/2023.findings-acl.426.pdf

Code Link:

https://github.com/bazingagin/npc_gzip

Assistant Professor William Wang from UCSB described this paper as more innovative than 95% of the main conference papers at ACL, and he does not understand why it was only accepted as a Findings paper:

Some netizens also called this the most ridiculous result they have seen this year:

What innovation does this paper have that has garnered such praise? Let’s take a look at the specific content.

This article is mainly focused on the task of text classification. It states that text classification, as one of the most fundamental tasks in Natural Language Processing (NLP), has made significant improvements with the help of neural networks. However, most neural networks have high data requirements, which increase as the number of model parameters increases.

Among the many models, Deep Neural Networks (DNN) are often used for text classification due to their high accuracy. However, DNNs are computationally intensive, and the cost of using and optimizing these models in practice, as well as transferring to Out-Of-Distribution (OOD) generalization, is very high.

Researchers have begun to seek lightweight alternatives to DNNs, and the use of compressors for text classification has started to gain attention. In this field, some researchers have entered, but existing methods still cannot match the performance of neural networks.

To address this flaw, this paper proposes a text classification method that combines lossless compressors (such as gzip) with k-Nearest Neighbor classifiers (kNN).

The paper states that by using this method without any training parameters, their experiments on seven in-distribution datasets and five out-of-distribution datasets show that using simple compressors like gzip, they achieved results on six out of seven datasets comparable to DNNs and outperformed all methods, including BERT, on five out-of-distribution datasets. Even in few-shot scenarios, the method significantly outperformed all models.

Netizens were also surprised by the result that gzip + kNN outperformed BERT and other neural network methods in text classification tasks. Even more surprisingly, this algorithm has no training process, no tuning, and no parameters — it is just 14 lines of code, which is the entire algorithm content.

Method Overview

The method in this paper includes a lossless compressor, a distance metric based on the compressor, and a K-Nearest Neighbor classifier. The lossless compressor aims to represent information using as few bits as possible by assigning shorter codes to higher probability symbols.

The idea of using compressors for classification is based on two intuitive insights: 1) compressors are good at capturing patterns, and 2) objects of the same category have stronger regularities than those of different categories.

For example, below, x_1 and x_2 belong to the same category, but x_3 belongs to a different category. If we denote C(・) as the compression length, we find that C(x_1x_2) − C(x_1) < C(x_1x_3) − C(x_1), where C(x_1x_2) represents the compression length of the concatenation of x_1 and x_2.

This intuitive knowledge can be formalized into a distance metric derived from Kolmogorov complexity. To measure the shared informational content between two objects, Bennett et al. defined the information distance E(x, y) in their 1998 paper “Information Distance” as the length of the shortest binary program that transforms x into y.

The incomputability of Kolmogorov complexity leads to the incomputability of E(x,y), which is why Li et al. proposed a normalized and computable version of information distance in their 2004 paper “The Similarity Metric”, namely Normalized Compression Distance (NCD), which approximates Kolmogorov complexity K(x) using compression length C(x). It is defined as follows:

The rationale behind using compression length is that the length of x that is maximally compressed by the compressor approaches K(x). Generally, the higher the compression ratio, the closer C(x) is to K(x).

The researchers indicate that since the main experimental results use gzip as the compressor, C(x) here represents the length of x after gzip compression. C(xy) is the compression length of concatenating x and y. With the distance matrix provided by NCD, k-Nearest Neighbor can be used for classification.

This method can be implemented in the following 14 lines of Python code. The input is training_set and test_set, both of which are arrays of (text, label) tuples and k.

This method is a simple, lightweight, and general alternative to DNNs.It is simple in that it requires no pre-training or training; lightweight in that it can classify without parameters or GPU resources; and general in that the compressor is independent of data types, and the non-parametric method does not introduce potential assumptions.

Experimental Results

In the experimental section, the researchers selected various datasets to study the impact of training sample size, number of categories, text length, and distribution differences on accuracy. The details of each dataset are shown in Table 1.

The researchers compared their method with 1) neural network methods that require training and 2) non-parametric methods directly using kNN classifiers, with or without pre-training on external data. They also compared it with three other non-parametric methods, namely word2vec, pre-trained sentence BERT (SentBERT), and instance length, all of which used kNN classifiers.

Results on In-Distribution Datasets

The researchers trained all baseline methods on seven datasets in Table 3, and the results showed that gzip performed very well on AG News, R8, and R52. Among them, the fine-tuned BERT achieved the best performance among all methods on the AG News dataset, while gzip achieved competitive results without any pre-training, only 0.007 lower than BERT.

On R8 and R52, the only non-pre-trained neural network that outperformed gzip was HAN. On the YahooAnswers dataset, gzip’s accuracy was about 7% lower than that of general neural methods. This may be due to the larger vocabulary on that dataset, making it difficult for the compressor to compress.

Thus, it can be seen that gzip performs poorly on very large datasets (like YahooAnswers) but is very competitive on medium to small datasets.

The researchers listed the average test accuracy of all baseline models (excluding TextLength, which had very low accuracy) in Table 4. The results showed that gzip was above or close to the average on all datasets except YahooAnswers.

Results on Out-Of-Distribution (OOD) Datasets

In Table 5, without any pre-training or fine-tuning, gzip outperformed BERT and mBERT on all datasets. The results indicate that gzip outperformed both pre-trained and non-pre-trained deep learning methods on OOD datasets, demonstrating the method’s generalizability across dataset distributions.

Few-Shot Learning

The researchers also compared gzip with deep learning methods under few-shot settings, conducting experiments on AG News, DBpedia, and SogouNews with non-pre-trained and pre-trained deep neural networks.

The results are shown in Figure 2, where gzip outperformed non-pre-trained models set to 5, 10, and 50 on all three datasets. When the shot count was as low as n = 5, gzip’s performance significantly surpassed that of the non-pre-trained models. Specifically, in the AG News 5-shot setting, gzip’s accuracy was 115% higher than that of fastText.

Additionally, in the 100-shot setting, gzip also performed better than non-pre-trained models on AG News and SogouNews, though it slightly underperformed on DBpedia.

The researchers further studied the comparison results between gzip and DNN methods under the 5-shot setting across five OOD datasets. The results showed that gzip significantly outperformed all deep learning methods. In the corresponding datasets, this method increased accuracy over BERT by 91%, 40%, 59%, 58%, and 194%, and over mBERT by 100%, 67%, 40%, 12%, and 130% respectively.

More Reading

Gzip + kNN Text Classification Beats Transformer with 14 Lines of Code

#Submission Channel#

Let Your Words Be Seen by More People

How can more quality content reach the reader group through a shorter path, reducing the cost for readers to find quality content? The answer is: people you don’t know.

There are always some people you don’t know who know what you want to know. PaperWeekly may serve as a bridge, facilitating the collision of different backgrounds and academic inspirations, sparking more possibilities.

PaperWeekly encourages university labs or individuals to share various quality content on our platform, which can be latest paper interpretations, academic hotspot analyses, research insights, or competition experience explanations, etc. Our only goal is to make knowledge flow.

📝 Basic Requirements for Submissions:

• The article must indeed be a personal original work that has not been published in public channels. If it has been published or is pending publication on other platforms, please clearly indicate.

• Manuscripts are suggested to be written in markdown format, with accompanying images sent as attachments, requiring clear images without copyright issues.

• PaperWeekly respects the original author’s right to attribution and will provide competitive remuneration for each accepted original first publication, specifically calculated based on article view counts and quality.

📬 Submission Channel:

• Submission Email:[email protected]

• Please include immediate contact information (WeChat) in your submission so we can contact the author as soon as the manuscript is selected.

• You can also directly add the editor’s WeChat (pwbot02) for quick submission, with a note: name – submission.

Gzip + kNN Text Classification Beats Transformer with 14 Lines of Code

△ Long press to add PaperWeekly editor

🔍

Now, you can also find us on Zhihu

Search for “PaperWeekly” on Zhihu’s homepage

Click “Follow” to subscribe to our column

Method Overview

Experimental Results

Leave a Comment Cancel reply