Gzip + KNN Outperforms BERT Classification Performance

Paper Introduction

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

https://aclanthology.org/2023.findings-acl.426/

This paper introduces a new method for text classification, providing a non-parametric alternative to Deep Neural Networks (DNNs). Although DNNs are widely used due to their high accuracy, they require a large amount of labeled data and millions of parameters, making their computational costs high, and optimizing and applying them to out-of-distribution (OOD) situations can be challenging in practice.

To address these issues, the authors propose a lightweight and general method suitable for text classification. This method combines simple compressors (e.g., gzip) with a k-nearest neighbors (KNN) classifier. Notably, this method does not require any training parameters.

Paper Approach

The paper combines lossless compressors with a compressor-based distance metric and the k-nearest neighbors classifier (kNN). The method leverages compressors to capture the regularity of the text and then transforms it into similarity scores using the compressor-based distance metric.

The intuition behind using compressors for classification is:

Compressors perform well in capturing regularity;
Objects from the same category share more regularity than those from different categories; (the compression ratio is higher when compressed together with samples from the same category)

Using gzip as the compressor, here C(x) represents the length of x after gzip compression. C(xy) represents the compressed length after concatenating x and y. With the distance matrix provided by NCD, we can use k-nearest neighbors for classification.

Paper Experiments

The paper trains all benchmark models on seven datasets using their complete training sets. The method proposed in the paper performs particularly well on AG News, R8, and R52. On the AG News dataset, fine-tuning BERT achieved the highest performance among all methods, while the paper’s method achieved competitive results without any pre-training, only 0.007 points lower than BERT.

On R8 and R52, the only non-pre-trained neural network that outperformed the paper’s method is HAN. For YahooAnswers, gzip’s accuracy is about 7% lower than the average neural network method. This may be due to YahooAnswers having a large vocabulary, making it difficult for the compressor to compress.

Paper Results Reproduction

English Tweet Sentiment Classification

In the experiment for English tweet sentiment classification, we used different numbers of training samples and recorded the accuracy and time consumption of both TFIDF + logistic regression (LR) and gzip + k-nearest neighbors (KNN) methods on the test set. The test set contains a total of 1000 samples.

Training Set Sample Size	100	500	1000	5000
TFIDF + LR Accuracy	0.377	0.455	0.453	0.484
TFIDF + LR Time (seconds)	0.375	0.391	0.399	0.536
gzip + KNN Accuracy	0.380	0.395	0.418	0.453
gzip + KNN Time (seconds)	4.23	19.00	37.50	180.00

In terms of time consumption, TFIDF + LR has a shorter computation time, while gzip + KNN requires more time to compute, especially with a larger number of samples.
gzip + KNN performs relatively well with a small number of training samples, but as the sample size increases, TFIDF + LR gradually catches up and surpasses the performance of gzip + KNN.

Chinese Intent Recognition

In the experiment for Chinese intent recognition, we used different numbers of training samples and recorded the accuracy and time consumption of both TFIDF + logistic regression (LR) and gzip + k-nearest neighbors (KNN) methods on the test set. The test set contains a total of 1000 samples.

Training Set Sample Size	100	500	1000	5000
TFIDF + LR Accuracy	0.417	0.616	0.644	0.663
TFIDF + LR Time (seconds)	0.306	0.508	0.571	1.49
gzip + KNN Accuracy	0.443	0.547	0.576	0.635
gzip + KNN Time (seconds)	4.12	18.60	36.70	180.00

The overall pattern is similar to the English dataset, where gzip + KNN performs better with fewer training samples, but when the training set is sufficient, its performance and speed are poor.

Paper Method Summary

The core idea of the paper is to utilize lossless compressors to capture the regularity of the text and convert it into similarity scores through the compressor distance metric.
With a smaller number of samples, gzip + KNN performs better, but as the sample size increases, it takes more time. It can only be used for few-sample classification and cannot be used on a large scale.

# Competition Group Chat Invitation#

Gzip + KNN Outperforms BERT Classification Performance

△ Long press to add the competition assistant

Daily large models, algorithm competitions, and valuable information

Communicate with 36,000+competition enthusiasts~ Gzip + KNN Outperforms BERT Classification Performance