Feature Importance Analysis and Selection with XGBoost in Python

The benefit of using ensemble decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from trained predictive models.

In this article, you will discover how to estimate the importance of features for predictive modeling problems using the XGBoost library in Python. After reading this article, you will know:

How to calculate feature importance using the gradient boosting algorithm.
How to plot feature importance calculated by the XGBoost model in Python.
How to perform feature selection using feature importance calculated by XGBoost.

Feature Importance in Gradient Boosting

The benefit of gradient boosting is that after building the boosted trees, it is relatively straightforward to retrieve the importance score for each attribute. Typically, importance provides a score that indicates the usefulness or value of each feature when building the boosted decision trees in the model. The more key decision attributes used in the decision trees, the higher their relative importance.

This importance is explicitly calculated for each attribute in the dataset, allowing for ranking and comparison of attributes. The importance of a single decision tree is calculated by the number of performance metric increases at each attribute split point, weighted by the number of observations responsible for the node. Performance metrics can be purity (Gini index) used for selecting the split points or other more specific error functions. The feature importance is then averaged across all decision trees in the model. For more technical information on how to calculate feature importance in boosted decision trees, see “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” (page 367) Section 10.13.1 “Relative Importance of Predictive Variables.” Also, see Matthew Drury’s response to the StackOverflow question “Relative Variable Importance in Boosting,” where he provides a very detailed and practical answer.

Manually Plotting Feature Importance

A trained XGBoost model automatically calculates the feature importance in your predictive modeling problem. These importance scores can be obtained from the feature_importances_ member variable of the trained model. For example, they can be printed directly as follows:

print(model.feature_importances_)

We can plot these scores directly in a bar chart to visually represent the relative importance of each feature in the dataset. For example:

# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()

We can demonstrate this by training the XGBoost model on the Pima Indians diabetes dataset and creating a bar chart based on the calculated feature importance.

Download the dataset and place it in the current working directory.

Dataset file:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv

Dataset details:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names

# plot feature importance manually
from numpy import loadtxt
from xgboost import XGBClassifier
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# feature importance
print(model.feature_importances_)
# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()

Note: Your results may vary due to the randomness of the algorithm or evaluation procedure, or differences in numerical precision. Consider running this example a few times and comparing the average results.

Running this example will output the importance scores.

[ 0.089701    0.17109634  0.08139535  0.04651163  0.10465116  0.2026578 0.1627907   0.14119601]

We also obtain a bar chart of relative importance.

Feature Importance Analysis and Selection with XGBoost in Python

The downside of this chart is that the features are sorted by their input index rather than their importance. We can sort the features before plotting.

Fortunately, there is a built-in plotting function that can help us.

Using the built-in feature importance plot function provided by the XGBoost library, we can plot the features in order of importance. This function is calledplot_importance() and can be used as follows:

# plot feature importance
plot_importance(model)
pyplot.show()

For example, here is the complete code snippet that uses the built-inplot_importance() function to plot the feature importance of the Pima Indians dataset.

# plot feature importance using built-in function
from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()

Running this example will provide us with a more useful bar chart.

Feature Importance Analysis and Selection with XGBoost in Python

You can see that the features are automatically named according to their index in the input array (X) from F0 to F7. By manually mapping these indices to the names in the problem description, we can see that the chart shows F5 (Body Mass Index) with the highest importance, while F3 (Skin Fold Thickness) has the lowest importance.

Feature Selection Using XGBoost Feature Importance Scores

Feature importance scores can be used for feature selection in scikit-learn. This is done using theSelectFromModel class, which takes a model and can transform the dataset into a subset with selected features. This class can take a pre-trained model, such as one trained on the entire training dataset. It can then use a threshold to determine which features to select. When you call thetransform() method on aSelectFromModel instance to consistently select the same features on the training and test datasets, this threshold will be used.

In the example below, we first train the XGBoost model and then evaluate it separately on the entire training dataset and the test dataset. We use the feature importance calculated from the training dataset, then wrap the model in aSelectFromModel instance. We use it to select features on the training dataset, train the model on the selected feature subset, and then evaluate the model on the test set, following the same feature selection scheme.

For example:

# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)

Out of curiosity, we can test multiple thresholds to select features based on feature importance. Specifically, the feature importance for each input variable essentially allows us to test each feature subset by importance, starting from all features down to a subset with the most important features.

Below is the complete code snippet:

# use feature importance for feature selection
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(model, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 y_pred = selection_model.predict(select_X_test)
 predictions = [round(value) for value in y_pred]
 accuracy = accuracy_score(y_test, predictions)
 print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

Note: If you are using XGBoost 1.0.2 (and possibly other versions), there is a bug in the XGBClassifier class that can lead to the error:

KeyError: 'weight'

This can be resolved by using a customXGBClassifier class that returnsNone for thecoef_ property. Below is the complete example.

# use feature importance for feature selection, with fix for xgboost 1.0.2
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
 
# define custom class to fix bug in xgboost 1.0.2
class MyXGBClassifier(XGBClassifier):
 @property
 def coef_(self):
  return None
 
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = MyXGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(model, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 predictions = selection_model.predict(select_X_test)
 accuracy = accuracy_score(y_test, predictions)
 print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

Running this example will print the following output.

Accuracy: 77.95%
Thresh=0.071, n=8, Accuracy: 77.95%
Thresh=0.073, n=7, Accuracy: 76.38%
Thresh=0.084, n=6, Accuracy: 77.56%
Thresh=0.090, n=5, Accuracy: 76.38%
Thresh=0.128, n=4, Accuracy: 76.38%
Thresh=0.160, n=3, Accuracy: 74.80%
Thresh=0.186, n=2, Accuracy: 71.65%
Thresh=0.208, n=1, Accuracy: 63.78%

We can see that the performance of the model generally decreases with the number of selected features. In this regard, there is a trade-off between the features and the accuracy of the test set, and we may decide to adopt a less complex model (with fewer attributes, e.g., n = 4) and accept a moderate decrease in estimated accuracy from 77.95% down to 76.38%.

This may be a baptism for such a small dataset, but for larger datasets and using cross-validation as a model evaluation scheme, it may be a more useful strategy.

Author: Yishui Hancheng, CSDN Blog Expert, personal research directions: Machine Learning, Deep Learning, NLP, CV

Blog: http://yishuihancheng.blog.csdn.net

Appreciate the Author

Leave a Comment Cancel reply