Summary of Decision Trees, Random Forests, Bagging, Boosting, Adaboost, GBDT, and XGBoost

Official WeChat account of Tsinghua Big Data Software Team
Source: Zhihu

This article is about 5000 words long, and it is recommended to read for 5 minutes.
This article systematically summarizes the related content about decision trees, random forests, etc.

1、Decision Tree

A decision tree is a supervised classification model that essentially selects a feature value with the maximum information gain for splitting the input until the stopping condition is met or the purity of the leaf nodes reaches the threshold. The diagram below is an example of a decision tree:

Summary of Decision Trees, Random Forests, Bagging, Boosting, Adaboost, GBDT, and XGBoost

According to the splitting criteria and methods, decision trees can be divided into: ID3, C4.5, and CART algorithms.

1. ID3 Algorithm: Selects the Optimal Partition Attribute Based on Information Gain

The calculation of information gain is based on information entropy (a measure of the purity of a sample set).

The smaller the information entropy, the greater the purity of the dataset

Assuming a decision tree is built based on the dataset

with

categories:

In formula (1):

represents the proportion of the total number of samples of class K to the total number of samples in dataset D.

Formula (2) indicates that using feature A as the splitting attribute results in the information entropy: Di represents the set of nodes in the i-th branch formed by partitioning by attribute A. Thus, this formula calculates the total information entropy of n branches formed by partitioning by attribute A.

Formula (3) is the difference in information entropy before and after partitioning by A, which is the information gain, the greater the better.

Assuming each record has an attribute ‘ID’, if we partition based on ID, the feature values that can be obtained on this attribute are the sample counts. If the number of features is too large, regardless of which ID is used for partitioning, the leaf node values will only have one, resulting in high purity and large information gain. Thus, the decision tree formed this way is meaningless, i.e., ID3 tends to favor attributes with many values for partitioning, which has a certain bias. To reduce this effect, some scholars proposed the C4.5 algorithm.

2. C4.5: An Algorithm that Selects the Most Optimal Partition Attribute Based on Information Gain Ratio

Information gain ratio introduces a penalty term known as split information to penalize attributes with many values:

Where IV(a) is determined by the number of feature values of attribute A; the more values, the larger the IV value, the smaller the information gain ratio, thus avoiding the model’s preference for attributes with many feature values. If we simply split according to this rule, the model will prefer attributes with fewer feature values. Therefore, C4.5 first identifies candidate partition attributes with information gain above average, and then selects the one with the highest gain ratio from them.

For continuous value attributes, the number of possible values is no longer limited, so discretization techniques (such as binary methods) can be used. The attribute values are sorted from smallest to largest, and the median value is selected as the split point; points with values smaller than it are assigned to the left subtree, and points not smaller than it are assigned to the right subtree, and the information gain ratio of the split is calculated, selecting the attribute value with the highest information gain ratio for splitting.

3. CART: Selects the Optimal Partition Attribute Based on Gini Index, Can Be Used for Classification and Regression

CART is a binary tree that uses binary splitting; it splits the data into two parts each time, entering the left and right subtrees respectively. Each non-leaf node has two children, so the leaf nodes of CART are one more than the non-leaf nodes. Compared to ID3 and C4.5, CART is applied more widely, as it can be used for both classification and regression. In CART classification, the Gini index is chosen as the best classification feature, which describes purity and has a meaning similar to information entropy. In CART, the Gini coefficient is reduced at each iteration.

Gini(D) reflects the purity of dataset D; the smaller the value, the higher the purity. We select the attribute that minimizes the Gini index after partitioning from the candidate set as the optimal partition attribute.

Classification Tree and Regression Tree

First, let’s discuss classification trees. ID3 and C4.5 exhaustively check each threshold of each feature attribute at each branch to find the feature that maximizes the entropy of the two branches formed by the feature values <= threshold and feature values > threshold. This standard branching creates two new nodes, and the same method is applied until all individuals are assigned to a uniquely gendered leaf node or the preset stopping condition is met. If the final leaf node does not have a unique gender, the majority gender is assigned to that leaf node.

The overall process for regression trees is similar, but at each node (not necessarily a leaf node), a predicted value is obtained. For example, the predicted value for age equals the average age of all individuals in that node. During branching, every threshold of each feature is exhaustively checked to find the best split point, but the measure used is to minimize the mean squared error, i.e., the total of (each individual’s age – predicted age)^2 / N, or the sum of the squared prediction errors for each individual divided by N. This is understandable; the more individuals predicted to be coarse, the more outrageous the errors, resulting in a larger mean squared error. By minimizing the mean squared error, we find the most reliable basis for branching. Branching continues until the ages of individuals at each leaf node are unique (which is very difficult) or until the preset stopping condition (such as a limit on the number of leaves) is reached. If the final leaf node does not have a unique age, the average age of all individuals in that node is used as the predicted age for that leaf node.

2、Random Forest

First, let’s supplement the concept of ensemble classifiers, which combine the results of multiple classifiers through voting or averaging to arrive at the final result.

1. Benefits of Building Ensemble Classifiers:

(1) Improve model accuracy: Integrating the classification results of various models leads to a more reasonable decision boundary, reducing overall errors and achieving better classification results:

(2) Handle overly large or small datasets: When the dataset is large, it can be divided into multiple subsets to build classifiers for each subset; when the dataset is small, bootstrap sampling can be used to create multiple different datasets from the original dataset to build classifiers.

(3) If the decision boundary is too complex, linear models cannot accurately describe the true situation. Therefore, for datasets in specific regions, multiple linear classifiers are trained, and then they are integrated.

(4) More suitable for handling multi-source heterogeneous data (different storage methods (relational, non-relational), different categories (time series, discrete, continuous, network structure data))

Random forests are a combination classifier of multiple decision trees, where randomness is mainly reflected in two aspects: the randomness of data selection and the randomness of feature selection.

(1) Random selection of data

First, bootstrap sampling is performed from the original dataset to construct a subset, with the number of elements in the subset being the same as that in the original dataset. Elements in different subsets can be repeated, and elements within the same subset can also be repeated.

Second, using the subset to build sub-decision trees, this data is input into each sub-decision tree, and each sub-decision tree outputs a result. Finally, if new data needs to be classified by the random forest, the voting results from the sub-decision trees are used to obtain the output result of the random forest. As shown in the figure below, if there are 3 sub-decision trees in the random forest, with 2 trees classifying it as class A and 1 tree classifying it as class B, then the classification result of the random forest is class A.

(2) Random selection of candidate features

Similar to random selection of datasets, during the splitting process of each subtree in the random forest, not all candidate features are used; rather, a certain number of features are randomly selected from all candidate features, and then the optimal feature is chosen from the randomly selected features. This allows the decision trees in the random forest to differ, enhancing the diversity of the system, thereby improving classification performance.

Example diagram of ensemble trees

3. GBDT and XGBoost

1. Before discussing GBDT and XGBoost, let’s supplement knowledge on Bagging and Boosting.

Bagging is a parallel learning algorithm with a simple idea: each time, a dataset of the same size is sampled from the original data based on a uniform probability distribution with replacement. Sample points can appear multiple times, and a classifier is constructed for each generated dataset, which are then combined.

Boosting has different sample distributions for each sampling; at each iteration, the weights of misclassified samples are increased based on the results of the previous iteration. This causes the model to pay more attention to difficult-to-classify samples in subsequent iterations. This is a continuous learning process and a continuous improvement process, which is the essence of the Boosting idea. After iteration, the base classifiers from each iteration are integrated, and how to adjust sample weights and integrate classifiers is a key issue we need to consider.

Structure diagram of Boosting algorithm

Taking the famous Adaboost algorithm as an example:

There is a dataset with a sample size of N, each sample corresponds to an original label. Initially, we set the weights of the samples to 1/N.

The classification error rate of the model is calculated based on the current data, and the coefficient value of the model is based on the classification error rate.

According to the model’s classification results, the distribution of data in the original dataset is updated, increasing the probability of misclassified data being sampled so that they can be retrained by the model in the next iteration.

The final classifier is a combination of all base classifiers.

2. GBDT

GBDT is a gradient boosting algorithm using decision trees (CART) as base learners, and it is an iterative tree rather than a classification tree. Boost means “enhancement”; generally, Boosting algorithms are iterative processes where each new training aims to improve upon the previous results. With the previous explanation of Adaboost, the general idea should be easy to understand.

The core of GBDT is: each tree learns the residuals of the conclusions from all previous trees. This residual is the accumulated amount that, when added to the predicted value, gives the true value. For example, if A’s true age is 18 years, but the first tree predicts A’s age as 12 years, the residual is 6 years. Therefore, in the second tree, we set A’s age to 6 years for learning, and if the second tree can indeed place A in the leaf node for 6 years, then the accumulated conclusions from the two trees will equal A’s true age. If the conclusion from the second tree is 5 years, then A still has a residual of 1 year, and in the third tree, A’s age will be adjusted to 1 year for further learning.

3. XGBoost

XGBoost applies numerical optimization more effectively than GBDT, and most importantly, the loss function (the error between predicted values and true values) becomes more complex. The objective function is still the sum of the predicted values from all trees equaling the predicted value.

The loss function is as follows, introducing first and second derivatives:

Leave a Comment Cancel reply