Mathematical Derivation and Pure Python Implementation of Machine Learning Algorithm: XGBoost

Click the aboveBeginner Learning Vision” to choose to add Star or “Pin

Important content delivered in real-time

Since Chen Tianqi proposed XGBoost in 2015, this model has been frequently used as a powerful tool in major data competitions. Its greatest advantages are speed and effectiveness. XGBoost is of the same lineage as GBDT and belongs to the boosting ensemble learning algorithm, but XGBoost is superior to GBDT.
The full name of XGBoost is eXtreme Gradient Boosting, which means extreme gradient boosting trees. To deeply understand the entire XGBoost model system, it is recommended to carefully study Chen Tianqi’s XGBoost: A Scalable Tree Boosting System paper, focusing on the derivation of the loss function, so as to better grasp XGBoost. This article will only explain the most important part of the model, which is the mathematical derivation of the XGBoost loss function and the gain calculation method for node splitting.
Derivation of XGBoost Principles
Since XGBoost is still part of the GBDT system, it must also be an additive model composed of multiple base models, so XGBoost can be represented as:
Assuming that the tree model to be trained in the th iteration is , we have:
Now let’s get to the point and derive the XGBoost loss function:
The original form of the loss function can be expressed as:
Where is the regularization term of the loss function, representing the total complexity of all trees, aimed at preventing model overfitting.
XGBoost comes from GBDT and also uses the forward stepwise algorithm. Taking the model at the th step as an example, the model’s predicted value for the th sample is:
Where is the predicted value given by the th step model, which is a known constant, and is the predicted value of the th step tree model. Therefore, the objective function can be rewritten as:
We can also split the regularization term, and since the structure of the previous trees has been determined, the total complexity of the previous trees can also be expressed as a constant:
Then we use the second-order Taylor formula (here we need everyone to review relevant knowledge of calculus), we can rewrite the loss function as:
Where is the first derivative of the loss function, and is the second derivative of the loss function. Note that this is the derivative with respect to . XGBoost, compared to GBDT, uses second-order derivative information, so if you want to customize the loss function, the primary requirement is that it should be twice differentiable.
For example, using the squared loss function:
We have:
Substituting this second-order Taylor expansion into the previously derived XGBoost loss function, we can obtain the approximate expression of the loss function:
By removing the relevant constant terms from the above expression, the simplified loss function is:
So we only need to find the first and second derivatives of the loss function at each step, and then optimize the objective function to get the predictions for each step, finally obtaining a boosting model based on the additive model.
Then redefine a decision tree, which includes two parts: the weight vector of the leaf nodes and the mapping relationship from instances (samples) to leaf nodes (essentially the tree’s branching structure);
The mathematical representation of a tree is:
Next, let’s look at the regularization term that defines model complexity. The model complexity can be determined by the number of leaf nodes of a single tree and the leaf weights , specifically, the complexity of the loss function is determined by the total number of leaf nodes and leaf weights of all trees. The mathematical expression is as follows:
Then we regroup all leaf nodes. All samples belonging to the th leaf node are grouped into a sample set of a leaf node, that is: , thus the objective function of XGBoost can be rewritten as:
Define , and simplify the expression, meaning as follows:
  • : The sum of the first-order partial derivatives of the samples contained in the leaf node is a constant;
  • : The sum of the second-order partial derivatives of the samples contained in the leaf node is a constant;
Substituting and into the previously derived XGBoost loss function, we obtain the final expression of the loss function:
According to the solution formula for a quadratic equation, we have:
Substituting into the final loss function of XGBoost, we get the loss:
For each leaf node , we can separate it from the objective function, giving:
From the previous derivation, we know that and can be calculated relative to the th tree. So this expression is a univariate quadratic function that only contains the leaf node weight , and we can find its extremum using the extremum formula. When the leaf nodes of each tree that are independent of each other reach their optimal value, the entire loss function also reaches its optimal value.
When the tree structure is fixed, taking the derivative of the above expression and setting it to 0 gives the optimal point and optimal value as:
The above is the complete derivation process of the XGBoost loss function. The basic framework is still the GBDT framework, but the biggest feature of XGBoost is that it uses the second derivative information of the loss function, aiming to more accurately approximate the real loss.
The following figure shows the calculation of leaf node weights given in the XGBoost paper:
Mathematical Derivation and Pure Python Implementation of Machine Learning Algorithm: XGBoost
By using second-order derivative information, XGBoost optimization has approached the real loss state. The node splitting method is essentially not much different from the CART tree’s node splitting method, but the calculation method for information gain is different.
Assuming the model completes feature splitting at a certain node, the objective function before splitting can be written as:
The objective function after splitting is:
Then for the objective function, the gain after splitting is:
If the gain Gain>0, that is, if the objective function decreases after splitting into two leaf nodes, then consider the result of this split. In practice, it is necessary to traverse all features to find the best splitting feature.
This is the core mathematical derivation part of the XGBoost model. The complete XGBoost model has many engineering details, which will not be elaborated here. It is best for readers to carefully study the XGBoost paper. The complete XGBoost derivation diagram is as follows.
Mathematical Derivation and Pure Python Implementation of Machine Learning Algorithm: XGBoost
XGBoost Algorithm Implementation
With the experience of implementing the GBDT algorithm, implementing XGBoost is not too difficult. Most of the underlying code is quite similar, mainly making some changes in the calculation of information gain, leaf score calculation, and the second derivative information of the loss function. First, let’s list the code framework:

Mathematical Derivation and Pure Python Implementation of Machine Learning Algorithm: XGBoost

The definitions of the underlying decision tree nodes and tree models are basically not much different, so I will not list them here. You can refer to Lecture 15 on the implementation of GBDT. The main task is to inherit the underlying tree nodes and trees to define the fitting method of the XGBoost single tree and XGBoost tree model.
Define the XGBoost single tree model as follows:
class XGBoostTree(Tree):    # Node splitting    def _split(self, y):        col = int(np.shape(y)[1]/2)        y, y_pred = y[:, :col], y[:, col:]        return y, y_pred
    # Information gain calculation formula    def _gain(self, y, y_pred):        Gradient = np.power((y * self.loss.gradient(y, y_pred)).sum(), 2)        Hessian = self.loss.hess(y, y_pred).sum()        return 0.5 * (Gradient / Hessian)
    # Tree splitting gain calculation    def _gain_by_taylor(self, y, y1, y2):        # Node splitting        y, y_pred = self._split(y)        y1, y1_pred = self._split(y1)        y2, y2_pred = self._split(y2)
        true_gain = self._gain(y1, y1_pred)        false_gain = self._gain(y2, y2_pred)        gain = self._gain(y, y_pred)        return true_gain + false_gain - gain
    # Optimal weight for leaf nodes    def _approximate_update(self, y):        # y split into y, y_pred        y, y_pred = self._split(y)        # Newton's Method        gradient = np.sum(y * self.loss.gradient(y, y_pred), axis=0)        hessian = np.sum(self.loss.hess(y, y_pred), axis=0)        update_approximation =  gradient / hessian
        return update_approximation
    # Tree fitting method    def fit(self, X, y):        self._impurity_calculation = self._gain_by_taylor        self._leaf_value_calculation = self._approximate_update        super(XGBoostTree, self).fit(X, y)
Then define the XGBoost model based on the forward stepwise algorithm:
class XGBoost(object):
    def __init__(self, n_estimators=200, learning_rate=0.001, min_samples_split=2,                 min_impurity=1e-7, max_depth=2):        self.n_estimators = n_estimators            # Number of trees        self.learning_rate = learning_rate          # Step size for weight update        self.min_samples_split = min_samples_split  # The minimum n of sampels to justify split        self.min_impurity = min_impurity            # Minimum variance reduction to continue        self.max_depth = max_depth                  # Maximum depth for tree

        # Cross-entropy loss        self.loss = LogisticLoss()
        # Initialize model        self.estimators = []        # Forward stepwise training        for _ in range(n_estimators):            tree = XGBoostTree(                    min_samples_split=self.min_samples_split,                    min_impurity=min_impurity,                    max_depth=self.max_depth,                    loss=self.loss)
            self.estimators.append(tree)
    def fit(self, X, y):        y = to_categorical(y)
        y_pred = np.zeros(np.shape(y))        for i in range(self.n_estimators):            tree = self.trees[i]            y_and_pred = np.concatenate((y, y_pred), axis=1)            tree.fit(X, y_and_pred)            update_pred = tree.predict(X)            y_pred -= np.multiply(self.learning_rate, update_pred)
    def predict(self, X):        y_pred = None        # Prediction        for tree in self.estimators:            update_pred = tree.predict(X)            if y_pred is None:                y_pred = np.zeros_like(update_pred)            y_pred -= np.multiply(self.learning_rate, update_pred)
        y_pred = np.exp(y_pred) / np.sum(np.exp(y_pred), axis=1, keepdims=True)        # Convert probability predictions to labels        y_pred = np.argmax(y_pred, axis=1)        return y_pred
Using sklearn data as an example:
from sklearn import datasets
data = datasets.load_iris()
X = data.data
 y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, seed=2)  
clf = XGBoost()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print ("Accuracy:", accuracy)
Accuracy: 0.9666666666666667
XGBoost also provides a native model library for us to call. You can install xgboost via pip:
pip install xgboost
Similarly, test using the sklearn dataset:
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
# Set model parameters
params = {    'booster': 'gbtree',    'objective': 'multi:softmax',       'num_class': 3,         'gamma': 0.1,    'max_depth': 2,    'lambda': 2,    'subsample': 0.7,    'colsample_bytree': 0.7,    'min_child_weight': 3,    'silent': 1,    'eta': 0.001,    'seed': 1000,    'nthread': 4,}
plst = params.items()
dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 200
model = xgb.train(plst, dtrain, num_rounds)
# Predict on the test set
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print ("Accuracy:", accuracy)
# Plot feature importance
plot_importance(model)
plt.show();
Accuracy: 0.9166666666666666
Mathematical Derivation and Pure Python Implementation of Machine Learning Algorithm: XGBoost

Good news!

Beginner Learning Vision Knowledge Circle

Is now open to the public👇👇👇






Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply in the "Beginner Learning Vision" public account background: Extension Module Chinese Tutorial to download the first OpenCV extension module tutorial in the network, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content.

Download 2: Python Vision Practical Projects 52 Lectures
Reply in the "Beginner Learning Vision" public account background: Python Vision Practical Projects to download 31 visual practical projects including image segmentation, mask detection, lane line detection, vehicle counting, adding eyeliner, license plate recognition, character recognition, emotion detection, text content extraction, and face recognition to help quickly learn computer vision.

Download 3: OpenCV Practical Projects 20 Lectures
Reply in the "Beginner Learning Vision" public account background: OpenCV Practical Projects 20 Lectures to download 20 practical projects based on OpenCV to achieve advanced learning of OpenCV.

Group Chat

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually be subdivided in the future). Please scan the WeChat ID below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format for remarks, otherwise, it will not be approved. After successfully added, you will be invited to enter the relevant WeChat group based on your research direction. Please do not send advertisements in the group, otherwise, you will be removed from the group, thank you for understanding~




Leave a Comment