AFS: An Attention-Based Mechanism for Supervised Feature Selection

Follow our official account to discover the beauty of CV technology

This article shares the AAAI 2019 paper『AFS: An Attention-based mechanism for Supervised Feature Selection』, proposing a supervised feature selection mechanism based on the attention mechanism.

Details are as follows:

Paper link: https://arxiv.org/abs/1902.11074
Project link: https://github.com/upup123/AAAI-2019-AFS

Background and Introduction

Feature selection is generally an effective data preprocessing step, especially for machine learning tasks involving high-dimensional data. However, existing feature selection techniques cannot address the computational complexity, scalability, and stability issues of noisy data. For example, many existing algorithms require loading the entire dataset into memory before computation, which becomes unfeasible when the dataset scales to TB; most feature selection algorithms exhibit low stability after introducing data perturbations in the training set.

The attention mechanism is a technique that focuses on the most relevant information rather than using all available information, and it has achieved significant success in various machine learning tasks. The authors found that the attention generation process is very similar to the feature selection process, as both focus on selecting a subset of data from high-dimensional datasets.

Therefore, the authors proposed a new supervised feature selection framework called AFS, which consists of two detachable modules: an attention module for feature weight generation and a learning module for problem modeling. The attention module reduces features and supervised targets to a binary classification problem, where each feature is supported by a shallow attention network capable of generating attention weights for classification and regression feature selection problems. Furthermore, feature weights are generated by adjusting the distribution of each feature selection pattern through backpropagation.

This paper’s main contributions are as follows:

A new attention-based supervised feature selection architecture is proposed: this architecture consists of an attention-based feature weight generation module and a learning module, while the highly coupled design allows for separate training or initialization of different modules.
An attention-based feature weight generation mechanism is proposed, transforming the feature weight generation problem into a feature selection pattern problem solvable by the available attention mechanism.
A model reuse mechanism for computational optimization is proposed, allowing for the direct reuse of existing models, effectively reducing the computational complexity of generating feature weights.
A hybrid initialization method for small datasets is proposed, combining existing feature selection methods for weight initialization to address the issue of insufficient data in small datasets for generating feature weights.

Related Work

Feature Selection Methods: Supervised feature selection methods are generally divided into wrapper, filter, and embedded methods. Wrapper methods rely on the predictive accuracy of predefined learning algorithms to evaluate how to select features; filter methods rely solely on metrics of general features in the training data to assess feature weights; embedded methods interact with learning algorithms and evaluate feature sets based on that interaction.

Attention Mechanism: The neural network attention mechanism is a method that takes parameters and context and returns a vector that focuses on information relevant to the context. For inputs with spatial structures (like images), attention construction focuses on the salient parts of the image; for inputs with temporal structures (like language and video), the attention mechanism is used to obtain the relationship between the current input and previous inputs through recurrent neural networks (RNN, LSTM). Overall, the attention mechanism typically provides domain-specific attention-based solutions for data with specific structures. In this paper, the focus is primarily on the feature selection problem for conventional data without prior knowledge.

AFS Architecture

As shown in Figure 1, AFS consists of two main modules: the attention module and the learning module. The attention module is located at the top of AFS and is responsible for calculating the weights of all features; the learning module aims to find the optimal correlation between weighted features and supervised targets by solving optimization problems.

The AFS architecture links the supervised target and features through a backpropagation mechanism, continuously correcting feature weights during the training process. Furthermore, AFS establishes a correlation problem between the supervised target and features through both the attention module and learning module. AFS has high coupling, allowing both the attention module and learning module to be trained separately to match specific tasks, and it can generate parameters for the attention module with much lower computational overhead. Additionally, the AFS architecture proposes a hybrid initialization method that utilizes existing feature selection algorithms to initialize the weights of the attention module.

Attention Module: To link the correlation between features and supervised targets, this paper transforms the correlation problem into a binary classification problem: for a specific supervised target, should this feature be selected, and then generates feature weights based on the distribution of feature selection patterns.

First, a neural network is used to extract the intrinsic relationship E=Tanh(X^TW₁+b₁) between the original input X, compressing the original feature domain into a smaller vector while retaining most information (discarding some redundant features and noise). The Tanh function, having both positive and negative values, retains important information during extraction.

Second, the extracted feature E is used as input, and each feature X^k is assigned a shallow neural network to determine its selection probability. In this paper, each attention unit in the attention layer generates two values: for the k-th feature, p^k and n^k represent the selected/unselected values (where h^k_L denotes the output of the L-th hidden layer in the k-th attention network):

Since p^k and n^k may be very close, softmax is used to generate differentiable results to statistically enhance the difference between selecting and not selecting this feature. We focus only on the probability of it being selected as an attention feature: a^k = exp(p^k) / (exp(p^k) + exp(n^k)), producing the attention matrix A, and then calculating the weight of the k-th feature s^k based on the attention matrix A by weighted averaging all samples.

The attention module has more significant advantages:1) Feature weights are generated by feature selection patterns, allowing the neural network to comprehensively consider the intrinsic relationships between features; 2) Feature weights are always constrained between 0 and 1, which can accelerate training convergence and facilitate training through backpropagation; 3) Redundant features are removed by the previous neural network E, and due to the smaller size of E, some information from redundant features will be discarded.

Learning Module: By using multiplication to relate the feature vector X and the attention feature weights A, i.e., G=X*A, the process of continuously adjusting A concerning the feature vector is equivalent to making trade-offs between selection and non-selection. The learning module performs backpropagation by solving the objective function as follows (R represents L2-norm, which helps accelerate the optimization process and prevent overfitting; λ is used to control the strength of regularization; the loss function depends on the type of prediction task: classification tasks generally use CE Loss, while regression tasks use MSE Loss):

Furthermore, for specific learning problems, AFS uses the most suitable network structure for the specific task. This mainly includes: Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN).

Reuse Mechanism: As seen in Figure 1, the computational complexity of the AFS architecture arises from the attention module and learning module. Since the AFS structure is detachable, the training of the learning module and attention module can be conducted separately. This design allows for the direct reuse of existing models (e.g., pre-trained VGG and ResNet networks) in AFS.

Furthermore, trained models can be used to initialize the model parameters of AFS, also known as AFS-R. Since the parameters in the learning module have already converged, training AFS-R only requires fine-tuning: either fine-tuning both the attention module and learning module (AFS-RGlobalTune), or fixing the learning module and only training the attention module (AFS-R-LocalTune).

Hybrid Initialization: Since AFS performance highly depends on the number of samples in the dataset, a small number of samples may not produce sufficient backpropagation to adjust the entire neural network. To extend the applicability of AFS on small datasets, this paper proposes a hybrid initialization method that uses the results of certain feature selection methods as initial feature weights.

This hybrid initialization method can be divided into three main steps: 1) Use a certain feature method to generate feature weights and normalize them to the range [0,1] using Min-Max normalization; 2) Pre-train the attention module, where each sample will be trained using the feature weights generated by the feature method from the previous step as labels, and use adaptive moment estimation (Adam) as the optimizer to optimize the attention module; 3) Use the normal training process to train the AFS neural network.

Experimental Results

Questions: Experiments were conducted to answer the following research questions: 1) Is AFS superior to state-of-the-art feature selection methods? 2) How much can computational complexity be reduced by reusing existing models? 3) How can existing feature selection methods be combined with AFS to improve feature selection performance?

Datasets: The overall overview of the datasets is shown in Table 1, where n-MNIST-AWGN represents added white Gaussian noise; n-MNIST-MB represents motion blur; n-MNIST-RCAWGN represents a combination of additive reduced contrast and Gaussian white noise. The above n-MNIST datasets provide a good foundation for evaluating feature selection stability.

Evaluation Protocols: Feature weights are obtained from the training data, then sorted, and a certain number of features are selected as a feature subset based on descending order, with the accuracy of the feature subset on the test set serving as the performance metric.

Parameter Settings: Model parameters are initialized using a truncated normal distribution with a mean of 0 and a standard deviation of 0.1, the model is optimized using the Adam optimizer, the training batch size is set to 100, the regularization weight is set to 0.0001, and the training steps are set to 3000.

Experiments on MNIST variants (Q1):

Feature weights from different methods are numerically sorted, and the top K features are selected based on the sorted order, input into a benchmark classifier, and fitted with the training data, then average results are reported based on classification accuracy. As shown in the following figures, different feature selection methods exhibit modeling accuracy on MNIST and its variants, and we can observe:

AFS achieved the best accuracy across all four datasets and nearly all feature selection ranges, significantly outperforming other comparative methods.
For different types of noise, AFS achieved optimal feature selection stability, demonstrating consistently good performance regardless of the type of noise introduced.
Moreover, as shown in Table 2, when the number of selected features is between 15 and 85, AFS significantly outperformed the other five methods, further indicating that AFS has the most accurate feature weight ranking, which is a crucial advantage in many modeling processes.
Table 3 presents the computational overhead of different feature selection methods, measured by the execution time of the feature weight generation process, revealing that the AFS algorithm has moderate computational complexity while providing a substantial performance boost.

Model Reuse (Q2):

To evaluate the contribution of reusing existing models to reducing computational complexity, a DNN model maintaining 98.4% classification accuracy on the MNIST test set was directly used as the learning model to test AFS-R-GlobalTune and AFS-R-LocalTune strategies:

As shown in Table 4, when the adjustment step size is small, the AFS-R-LocalTune solution generally achieves better accuracy than AFS-R-GlobalTune, as it requires tuning far fewer parameters; as the tuning step size increases, AFS-R-GlobalTune gradually achieves better accuracy. The computational overhead of AFS-R-GlobalTune is approximately 25s, about 11.5% of that used by AFS, with global tuning requiring more time than local tuning for the same steps due to the larger number of parameters that need to be tuned.

Performance on L/S Datasets (Q3):

Based on the hybrid initialization method, Fisher score and ReliefF were used as the foundation for feature initialization, the extended AFS is represented as AFS-Fisher and AFS-ReliefF. As shown in Figure 5, using the hybrid initialization method on small datasets significantly outperformed other solutions, indicating that the hybrid initialization method can greatly enhance the accuracy of existing feature selection algorithms. Additionally, it was found that a higher sample/feature ratio may assist the hybrid initialization method in achieving greater performance improvements.

Conclusion

This paper proposes a new feature selection architecture that extends the attention mechanism to general feature selection tasks. By describing the feature selection problem as a binary classification problem for each feature, it is possible to identify the weight of each feature based on its feature selection pattern. Moreover, the highly decoupled design allows for the separate training or reuse of different modules. Future work may consider developing more domain-optimized solutions for specific structured data and reducing its computational costs on ultra-high-dimensional datasets.

AFS: An Attention-Based Mechanism for Supervised Feature Selection

END

Join the “Computer Vision” group chat👇 Note:CV

Leave a Comment Cancel reply