Selected from Ancestry
Author: Tyler Folkman
Translated by Machine Heart
Contributors: Liu Xiaokun, Li Zenan
The algorithm using XGBoost often achieves good results in Kaggle and other data science competitions, making it popular (see: Why Does XGBoost Perform So Well in Machine Learning Competitions?). This article analyzes the prediction process of the XGBoost machine learning model using a specific dataset, and through visualization techniques, we can better understand the model’s prediction process.
As the industrial application of machine learning continues to develop, understanding, interpreting, and defining how machine learning models work seems to be an increasingly obvious trend. For non-deep learning types of machine learning classification problems, XGBoost is the most popular library. Due to its ability to scale well to large datasets and support for multiple languages, it is particularly useful in commercial environments. For example, using XGBoost makes it easy to train models in Python and deploy them into Java product environments.
Although XGBoost can achieve high accuracy, the process of how XGBoost makes decisions to achieve such high accuracy is still not very transparent. This lack of transparency can be a serious flaw when results are handed directly to clients. Understanding the reasons behind outcomes is very useful. Companies that turn to machine learning to understand data also need to comprehend the predictions made by models. This has become increasingly important. For instance, no one wants a credit institution to use a machine learning model to predict a user’s creditworthiness without being able to explain how those predictions are made.
Another example is if our machine learning model indicates that a marriage record and a birth record are related to the same person (record linkage task), but the dates on the records suggest that the parties in this marriage are a very old person and a very young person, we might question why the model linked them. In such cases, understanding why the model made such predictions is very valuable. The result might be that the model considered the uniqueness of names and locations and made the correct prediction. However, it might also be that the model’s features did not properly account for the age gap on the records. In this case, understanding the model’s predictions can help us find ways to improve model performance.
In this article, we will introduce some techniques to better understand the prediction process of XGBoost. This allows us to leverage the power of gradient boosting while still understanding the model’s decision-making process.
To explain these techniques, we will use the Titanic dataset. This dataset contains information about each Titanic passenger (including whether they survived). Our goal is to predict whether a passenger survived and understand the process behind that prediction. Even with this data, we can see the importance of understanding model decisions. Imagine if we had a dataset of passengers from a recent shipwreck. The purpose of building such a predictive model is not actually to predict the outcome itself, but understanding the prediction process can help us learn how to maximize the number of survivors in such accidents.
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import operator
import matplotlib.pyplot as plt
import seaborn as sns
import lime.lime_tabular
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
import numpy as np
from sklearn.grid_search import GridSearchCV
%matplotlib inline
The first thing we need to do is observe our data, which you can find on Kaggle (https://www.kaggle.com/c/titanic/data). After obtaining the dataset, we will perform some simple data cleaning. That is:
-
Remove names and passenger IDs
-
Convert categorical variables into dummy variables
-
Fill in missing values with the median and remove data
These cleaning techniques are very simple; the goal of this article is not to discuss data cleaning but to explain XGBoost, so these are quick and reasonable cleaning steps to prepare the model for training.
data = pd.read_csv("./data/titantic/train.csv")
y = data.Survived
X = data.drop(["Survived", "Name", "PassengerId"], 1)
X = pd.get_dummies(X)
Now let’s split the dataset into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
And construct a training pipeline with a few hyperparameter tests.
pipeline = Pipeline([('imputer', Imputer(strategy='median')), ('model', XGBClassifier())])
parameters = dict(model__max_depth=[3, 5, 7],
model__learning_rate=[.01, .1],
model__n_estimators=[100, 500])
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
Next, let’s check the test results. For simplicity, we will use the same metric as Kaggle: accuracy.
test_predictions = cv.predict(X_test)
print("Test Accuracy: {}".format(accuracy_score(y_test, test_predictions)))
Test Accuracy: 0.8101694915254237
So far, we have achieved a decent accuracy, ranking in the top 500 out of about 9,000 competitors on Kaggle. Thus, we still have room for further improvement, but we will leave that as an exercise for the reader.
We continue the discussion on understanding what the model has learned. A common method is to use the feature importance provided by XGBoost. The higher the level of feature importance, the greater the contribution of that feature to improving the model’s predictions. Next, we will use the importance parameter to rank the features and compare their relative importance.
fi = list(zip(X.columns, cv.best_estimator_.named_steps['model'].feature_importances_))
fi.sort(key = operator.itemgetter(1), reverse=True)
top_10 = fi[:10]
x = [x[0] for x in top_10]
y = [x[1] for x in top_10]
From the above figure, we can see that fare and age are important features. We can further examine the distribution of survival/non-survival with respect to fare:
We can clearly see that survivors had a much higher average fare compared to non-survivors, so considering fare as an important feature seems reasonable.
Feature importance may be a good way to understand general feature significance. If an anomaly occurs where the model predicts that a high-fare passenger will not survive, we can conclude that high fare does not necessarily lead to survival. Next, we will analyze other features that may lead the model to predict that the passenger will not survive.
This kind of individual-level analysis can be very useful for production machine learning systems. Consider other examples, such as using a model to predict whether someone can obtain a loan. We know that credit score will be an important feature for the model, but what if a customer with a high credit score is denied by the model? How do we explain that to the customer? And how do we explain it to the management?
Fortunately, recent research from the University of Washington on explaining the predictions of any classifier has emerged. Their method is called LIME, which has been open-sourced on GitHub (https://github.com/marcotcr/lime). This article does not intend to elaborate on this, and you can refer to the paper (https://arxiv.org/pdf/1602.04938.pdf)
Next, we try to apply LIME in our model. Basically, we first need to define an interpreter that processes the training data (we need to ensure that the training data set passed to the interpreter is the one that will be trained):
X_train_imputed = cv.best_estimator_.named_steps['imputer'].transform(X_train)
explainer = lime.lime_tabular.LimeTabularExplainer(X_train_imputed,
feature_names=X_train.columns.tolist(),
class_names=["Not Survived", "Survived"],
discretize_continuous=True)
Then you must define a function that takes a feature array as a variable and returns an array with the probabilities for each class:
model = cv.best_estimator_.named_steps['model']
def xgb_prediction(X_array_in):
if len(X_array_in.shape) < 2:
X_array_in = np.expand_dims(X_array_in, 0)
return model.predict_proba(X_array_in)
Finally, we pass an example for the interpreter to output the feature counts and labels:
X_test_imputed = cv.best_estimator_.named_steps['imputer'].transform(X_test)
exp = explainer.explain_instance(X_test_imputed[1], xgb_prediction, num_features=5, top_labels=1)
exp.show_in_notebook(show_table=True, show_all=False)
Here we have an example where there is a 76% probability of not surviving. We also want to see which features contribute the most to which class and how important they are. For example, when Sex = Female, the chances of survival are greater. Let’s take a look at the bar chart:
This seems reasonable. If you are female, it greatly increases your chances of survival in the training data. So why is the prediction result “Not Survived”? It seems that Pclass = 2.0 significantly lowers the survival rate. Let’s take a look:
It seems that the survival rate for Pclass equal to 2 is still relatively low, so we have gained more understanding of our prediction results. Looking at the top 5 features displayed by LIME, it seems this person should still have survived, let’s check their label:
y_test.values[0]
>>>1
This person did survive, so our model was wrong! Thanks to LIME, we have some insight into the issue: it seems that Pclass may need to be discarded. This approach can help us find ways to improve the model.
This article provides readers with a simple and effective way to understand XGBoost. We hope these methods can help you utilize XGBoost reasonably, allowing your model to make better inferences.
Original article link: https://blogs.ancestry.com/ancestry/2017/12/18/understanding-machine-learning-xgboost/
This article is compiled by Machine Heart, for reprint authorization, please contact this public account.
✄————————————————
Join Machine Heart (Full-time Reporter/Intern): [email protected]
Submissions or inquiries: [email protected]
Advertising & Business Cooperation: [email protected]