This article is a part of Chapter 10 of the book “Introduction to Machine Learning Basics” (by Huang Haiguang).
XGBoost Algorithm
XGBoost is a machine learning algorithm based on the Gradient Boosting Decision Tree (GBDT) invented in February 2014 by PhD student Chen Tianqi from the University of Washington. This algorithm not only has excellent learning performance but also high training efficiency, making it shine in data competitions.
XGBoost is a tool for large-scale parallel Boosting Trees, which is an open-source Boosting Tree toolkit that is significantly faster than common toolkits. Both XGBoost and GBDT are Boosting methods, with the main difference being the definition of the objective function, aside from some differences in engineering implementation and problem-solving.
10.5.1 Algorithm Insights
The principles of the XGBoost algorithm mainly include the following aspects:
1. Prevent Overfitting
The generalization error of machine learning algorithms can be divided into bias and variance. Bias refers to the degree of deviation between the expected prediction of the algorithm and the true prediction, reflecting the model’s fitting ability. Variance measures how changes in the training set of the same size lead to changes in learning performance, characterizing the impact of data perturbations.
As shown in Figure 10-9, as the complexity of the machine learning model increases, bias decreases, but variance increases. When the model is simpler, the model’s variance is low, but bias tends to be higher. In other words, high variance leads to overfitting, while high bias leads to underfitting.
XGBoost effectively addresses the overfitting problem by penalizing the weights of leaf nodes, thereby preventing overfitting. The penalty is equivalent to adding a regularization term, meaning the objective function is: training loss plus regularization term.
2. Use Second-Order Taylor Expansion to Accelerate Convergence
The loss function of GBDT uses first-order convergence and gradient descent. XGBoost, on the other hand, employs Newton’s method for second-order convergence, which involves performing a second-order Taylor expansion of the loss function and using the first two orders as the improved residual. Gradient descent uses the first derivative of the objective function, whereas Newton’s method uses the second derivative, which considers the trend of gradient change, making it easier to converge. Figure 10-10 shows the iterative paths of Newton’s method (left) and gradient descent (right), where the left path aligns more closely with the optimal descent path. The use of second-order convergence is also the reason for XGBoost’s accelerated convergence.
3. Use Derivative Statistics for Tree Construction Splitting Conditions
When searching for the best split point, XGBoost implements a precise greedy algorithm (Exact Greedy Algorithm). This search algorithm traverses all possible split points on a feature, calculates the corresponding loss reduction, and selects the optimal split point. The importance of a feature is determined by the total number of times it appears across all trees. After calculating the gain, the feature with the highest gain is selected for splitting, and its importance is incremented by 1.
4. Support Parallel Computing with Multi-threading Optimization
XGBoost’s parallelism does not mean that each tree can be trained in parallel. XGBoost fundamentally still adopts the boosting philosophy, where each tree must wait for the previous tree to finish training before starting. The parallelism in XGBoost refers to the parallelism in feature dimensions: before training, each feature is pre-sorted by feature values and stored in Block structure, which can be reused when looking for feature split points. With features stored as block structures, multi-threading can be utilized for parallel computation when searching for the best split point for each feature.
10.5.2 Derivation of XGBoost Algorithm
The main base learner in XGBoost is CART (Classification and Regression Trees). Here, the definition is the weight of the leaves, which assigns each node to a leaf function, is the number of trees, and is defined as a constant. Assuming the model optimized in the previous iteration is , the target function in the step is:
In formula (10.14), the first part is the training error between the predicted values and the true target values, and the second part is the sum of the complexities of each tree.
Using the Taylor expansion formula, a second-order Taylor expansion is applied:
Considering can be viewed as , we can see that:
In formula (10.16):
is the first derivative of :
is the second derivative of :
Then formula (10.14) can be transformed into:
Here, we first improve the definition of a tree as follows:
In XGBoost, we define complexity as:
Thus, the objective function can be defined as:
It is evident that is a constant, which can replace it without affecting the objective function, thus the objective function can be transformed into:
Where is the index set of data points assigned to the leaf. Since all data points on the same leaf have the same score, we modified the index of the total sum in the second line. Therefore, formula (10.23) is equivalent to:
That is:
We can further compress the expression by defining , resulting in:
To find the derivative, if a given tree structure is available, then in the case of minimizing the above expression (i.e., the derivative is 0), we get:
Then:
And obtain the corresponding optimal objective function value:
Taking Figure 10-11 as an example:
Based on the known data, the corresponding parameters are obtained as shown in Figure 10-12:
The data in Figure 10-12 indicates that, according to formula (10.30):
The smaller the score, the better the structure of this tree.
When searching for the best split point, a precise greedy algorithm (Exact Greedy Algorithm) is used. This search algorithm traverses all possible split points on a feature, calculates the corresponding loss reduction, and selects the optimal split point. The process of the precise greedy algorithm is shown in Algorithm 10.1.
Algorithm 10.1 Precise Greedy Algorithm |
---|
|
Algorithm Idea:
Since the structure of the tree is unknown, it is impossible to traverse all tree structures. Therefore, XGBoost uses a greedy algorithm to split nodes, starting from the root node, traversing all attributes, and examining possible values for the attributes. The complexity is: the number of leaf nodes in the decision tree minus 1. Let the sample sets assigned to the left and right subtrees be and , then the loss reduction caused by splitting this node is:
That is:
Where: is the score of the left subtree after splitting, is the score of the right subtree after splitting, is the score of the unpartitioned tree, and is the regularization term of the new leaves, which introduces the complexity cost of adding new leaf nodes. We aim to find an attribute and its corresponding size that maximizes this value.
10.5.3 Summary of XGBoost Algorithm
XGBoost is one of the most popular algorithms in algorithm competitions, pushing the optimization of GBDT to an extreme:
(1) XGBoost generates CART trees considering the complexity of the tree, which GBDT does not consider. GBDT only considers tree complexity during the pruning step.
(2) XGBoost fits the second derivative expansion of the loss function from the previous round, while GBDT fits the first derivative expansion. Therefore, XGBoost has higher accuracy and requires fewer iterations to achieve the same training effect.
(3) Both XGBoost and GBDT iteratively improve model performance, but XGBoost can enable multi-threading when selecting the best split points, significantly improving running speed. Of course, Microsoft later released LightGBM, which optimized memory usage and running speed, but in terms of the algorithm itself, the optimization points are not more than those in XGBoost.
The original text comes from this book: