Comparison and Tuning of XGBoost, LightGBM, and CatBoost Algorithms

Machine Learning

Author: louwill

Machine Learning Lab

Although deep learning is currently very popular, Boosting algorithms represented by XGBoost, LightGBM, and CatBoost still have a wide range of applications. Setting aside deep learning applications suitable for unstructured data such as images, text, speech, and video, Boosting algorithms are still the first choice for structured data with fewer training samples. This article briefly explains the connections and differences among the three major Boosting algorithms mentioned above, and compares them using a practical data case. It then introduces commonly used hyperparameter tuning methods for Boosting algorithms, including random search, grid search, and Bayesian optimization, along with corresponding code examples.

Comparison of the Three Major Boosting Algorithms

First, XGBoost, LightGBM, and CatBoost are currently classic state-of-the-art (SOTA) Boosting algorithms, all of which can be classified into the gradient boosting decision tree algorithm series. The three models are all ensemble learning frameworks based on decision trees, with XGBoost being an improvement over the original GBDT algorithm, while LightGBM and CatBoost have further optimizations based on XGBoost, each having its own advantages in terms of accuracy and speed.

This article does not discuss the detailed principles of the three models; please refer to 【Original Release】 Machine Learning Formula Derivation and Code Implementation 30 Lectures.pdf. So what are the major differences among these three Boosting algorithms? There are two main aspects. The first is that the tree construction methods of the three models are different. XGBoost uses a level-wise growth strategy for decision tree construction, LightGBM uses a leaf-wise growth strategy, while CatBoost uses a symmetric tree structure, where its decision trees are complete binary trees. The second significant difference is in the handling of categorical features. XGBoost itself does not have the capability to automatically handle categorical features; for categorical features in the data, we need to manually convert them into numerical values before inputting them into the model. In LightGBM, categorical feature names need to be specified, and the algorithm can handle them automatically; CatBoost is known for handling categorical features efficiently through target variable statistics and other feature encoding methods.

Next, we will use the Kaggle 2015 flight delay dataset as an example to experiment with the XGBoost, LightGBM, and CatBoost models. Figure 1 is an introduction to the flights dataset.

Figure 2 Flights Dataset

The complete dataset contains over 5 million flight records, with 31 features. For demonstration purposes, we sampled 1% of the original dataset and selected 11 features. After preprocessing, we rebuilt the training dataset, aiming to construct a binary classification model for whether a flight is delayed. The data reading and simple preprocessing process is shown in Code 1.

Code 1 Data Processing

# Import pandas and sklearn data partition module
import pandas as pd
from sklearn.model_selection import train_test_split
# Read flights dataset
flights = pd.read_csv('flights.csv')
# Sample 1% of the dataset
flights = flights.sample(frac=0.01, random_state=10)
# Feature sampling, select 11 specified features
flights = flights[["MONTH", "DAY", "DAY_OF_WEEK", "AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT", "ORIGIN_AIRPORT","AIR_TIME","DEPARTURE_TIME", "DISTANCE", "ARRIVAL_DELAY"]]
# Discretize labels, only delays over 10 minutes count as delayed
flights["ARRIVAL_DELAY"] = (flights["ARRIVAL_DELAY"]>10)*1
# Categorical features
cat_cols = ["AIRLINE", "FLIGHT_NUMBER", "DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
# Categorical feature encoding
for item in cat_cols:      flights[item] = flights[item].astype("category").cat.codes +1  # Data partition
X_train, X_test, y_train, y_test = train_test_split(      flights.drop(["ARRIVAL_DELAY"], axis=1),      flights["ARRIVAL_DELAY"],       random_state=10, test_size=0.3)
# Print sizes of the partitioned datasets
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Output:

(39956, 10) (39956,) (17125, 10) (17125,)

In Code 1, we first read the original flights dataset. Since the original dataset is too large, we sampled 1%, selected 11 features, and constructed a dataset with 57081 flight records and 11 features. Then we performed simple preprocessing on the sampled dataset, first binarizing the training labels, converting delays greater than 10 minutes to 1 (delayed) and those less than 10 minutes to 0 (not delayed), and then encoding categorical features such as “AIRLINE”, “FLIGHT_NUMBER”, “DESTINATION_AIRPORT”, and “ORIGIN_AIRPORT”. Finally, we partitioned the dataset, resulting in 39956 training samples and 17125 testing samples.

XGBoost

Now we will test the performance of the three models on this dataset. First, let’s look at XGBoost, as shown in Code 2.

Code 2 XGBoost

# Import xgboost module
import xgboost as xgb
# Import model evaluation AUC function
from sklearn.metrics import roc_auc_score
# Set model hyperparameters
params = {
    'booster': 'gbtree',
    'objective': 'binary:logistic',       
    'gamma': 0.1,
    'max_depth': 8,
    'lambda': 2,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'eta': 0.001,
    'seed': 1000,
    'nthread': 4,
}
# Wrap xgboost dataset
dtrain = xgb.DMatrix(X_train, y_train)
# Number of training rounds, i.e., number of trees
num_rounds = 500
# Model training
model_xgb = xgb.train(params, dtrain, num_rounds)
# Predict on the test set
dtest = xgb.DMatrix(X_test)
y_pred = model_xgb.predict(dtest)
print('AUC of testset based on XGBoost: ', roc_auc_score(y_test, y_pred))

Output:

AUC of testset based on XGBoost: 0.6845368959487046

In Code 15-2, we tested the performance of XGBoost on the flights dataset, imported relevant modules, and set model hyperparameters. We fitted the XGBoost model based on the training set, and finally used the trained model for predictions on the test set, obtaining an AUC of 0.6845 for the test set.

LightGBM

The testing process for LightGBM on the flights dataset is shown in Code 3.

Code 3 LightGBM

# Import lightgbm module
import lightgbm as lgb
dtrain = lgb.Dataset(X_train, label=y_train)
params = {"max_depth": 5, "learning_rate" : 0.05, "num_leaves": 500,  "n_estimators": 300}
# Specify categorical features
cate_features_name = ["MONTH","DAY","DAY_OF_WEEK","AIRLINE","DESTINATION_AIRPORT", "ORIGIN_AIRPORT"]  
# LightGBM model fitting
model_lgb = lgb.train(params, d_train, categorical_feature = cate_features_name)  
# Predict on the test set
y_pred = model_lgb.predict(X_test)
print('AUC of testset based on LightGBM: ', roc_auc_score(y_test, y_pred))

Output:

AUC of testset based on LightGBM: 0.6873707383550387

In Code 3, we tested the performance of LightGBM on the flights dataset, imported relevant modules, and set model hyperparameters. We fitted the LightGBM model based on the training set, and finally used the trained model for predictions on the test set, obtaining an AUC of 0.6873, which is similar to XGBoost’s performance.

CatBoost

The testing process for CatBoost on the flights dataset is shown in Code 4.

Code 4 CatBoost

# Import catboost module
import catboost as cb
# Categorical feature indices
cat_features_index = [0,1,2,3,4,5,6]
# Create CatBoost model instance
model_cb = cb.CatBoostClassifier(eval_metric="AUC",  one_hot_max_size=50, depth=6, iterations=300, l2_leaf_reg=1,  learning_rate=0.1)
# Fit CatBoost model
model_cb.fit(X_train, y_train, cat_features=cat_features_index)
# Predict on the test set
y_pred = model_cb.predict(X_test)
print('AUC of testset based on CatBoost: ', roc_auc_score(y_test, y_pred))

Output:

AUC of testset based on CatBoost: 0.5463773041667715

In Code 4, we tested the performance of CatBoost on the flights dataset, imported relevant modules, and set model hyperparameters. We fitted the CatBoost model based on the training set and finally used the trained model for predictions on the test set, obtaining an AUC of 0.54, which is significantly lower than XGBoost and LightGBM on this dataset. Table 1 shows a comprehensive comparison of the three models on the flights dataset.

From the comprehensive comparison results in Table 1, it can be seen that LightGBM outperforms both XGBoost and CatBoost in terms of both accuracy and speed. Of course, we only compared the three models directly on the dataset without further feature engineering and hyperparameter tuning; the results in Table 1 can all be optimized further.

Common Hyperparameter Tuning Methods

Machine learning models have many parameters that need to be manually set in advance, such as the batch size for training neural networks, and tree-related parameters for ensemble learning models like XGBoost. We refer to these parameters, which are not obtained through model training, as hyperparameters. The process of manually adjusting hyperparameters is what we commonly know as tuning. Common tuning methods in machine learning include grid search, random search, and Bayesian optimization.

Grid Search

Grid search is a commonly used hyperparameter tuning method, often used to optimize three or fewer hyperparameters. It is essentially an exhaustive method. For each hyperparameter, the user selects a small finite set to explore. Then, the Cartesian product of these hyperparameters generates several groups of hyperparameters. Grid search uses each group of hyperparameters to train the model and selects the hyperparameters with the smallest validation set error as the best hyperparameters.

For example, if we have three hyperparameters a, b, and c to optimize, with candidate values {1,2}, {3,4}, and {5,6}, respectively, then all possible combinations of parameter values form an 8-point 3-dimensional grid as follows: {(1,3,5),(1,3,6),(1,4,5),(1,4,6),(2,3,5),(2,3,6),(2,4,5),(2,4,6)}. Grid search traverses these 8 possible combinations of parameter values, conducting training and validation to ultimately find the optimal hyperparameters.

In sklearn, grid search tuning is implemented through the GridSearchCV module in the model_selection module, and this tuning process includes cross-validation. We will also use the aforementioned flights dataset as an example to demonstrate the grid search code for XGBoost.

Code 5 Grid Search

### Grid Search Example Based on XGBoost
# Import GridSearch module
from sklearn.model_selection import GridSearchCV
# Create XGBoost classifier instance
model = xgb.XGBClassifier()
# List of parameters to search
param_lst = {"max_depth": [3,5,7],              "min_child_weight" : [1,3,6],              "n_estimators": [100,200,300],              "learning_rate": [0.01, 0.05, 0.1]             }
# Create grid search
grid_search = GridSearchCV(model, param_grid=param_lst, cv=3,                                    verbose=10, n_jobs=-1)
# Execute search based on flights dataset
grid_search.fit(X_train, y_train)
# Output search results
print(grid_search.best_estimator_)

Output:

XGBClassifier(max_depth=5, min_child_weight=6, n_estimators=300)

Code 5 provides an example of grid search based on XGBoost. We first create an XGBoost classifier instance, then provide the parameters to be searched and their corresponding ranges, create a grid search object based on GridSearch, and finally fit the training data to output the results of the grid search parameters. It can be seen that when the maximum depth of the tree is 5, the minimum child weight is 6, and the number of trees is 300, the model achieves relatively optimal performance.

Random Search

Random search, as the name implies, involves randomly searching for and finding the optimal hyperparameters within specified ranges or distributions. Compared to grid search, not all hyperparameters within the given distribution are attempted; instead, a fixed number of parameters are sampled from the given distribution, and only these sampled hyperparameters are experimented with. Random search can sometimes be a more efficient tuning method than grid search. In sklearn, random search tuning is implemented through the RandomizedSearchCV method in the model_selection module. An example of random search tuning based on XGBoost is shown in Code 6.

Code 6 Random Search

### Random Search Example Based on XGBoost
# Import GridSearch module
from sklearn.model_selection import GridSearchCV
# Create XGBoost classifier instance
model = xgb.XGBClassifier()
# List of parameters to search
param_lst = {"max_depth": [3,5,7],              "min_child_weight" : [1,3,6],              "n_estimators": [100,200,300],              "learning_rate": [0.01, 0.05, 0.1]             }
# Create grid search
grid_search = GridSearchCV(model, param_grid=param_lst, cv=3,                                    verbose=10, n_jobs=-1)
# Execute search based on flights dataset
grid_search.fit(X_train, y_train)
# Output search results
print(grid_search.best_estimator_)

Output:

XGBClassifier(max_depth=5, min_child_weight=6, n_estimators=300)

Code 6 provides an example of random search usage, which is essentially similar to grid search. It can be seen that the random search results indicate that when the number of trees is 300, the minimum child weight is 6, the maximum depth is 5, and the learning rate is 0.1, the model achieves optimal performance.

Bayesian Optimization

In addition to the aforementioned two tuning methods, this section introduces a third method, which may also be the best tuning method, namely Bayesian optimization. Bayesian optimization is a parameter optimization method based on Gaussian processes and Bayesian theorem, widely used in hyperparameter tuning of machine learning models in recent years. Here, we will not delve into the mathematical principles of Gaussian processes and Bayesian optimization, but will simply demonstrate the basic usage and tuning examples of Bayesian optimization.

Bayesian optimization, like other optimization methods, aims to find the parameter values that maximize the objective function. As a sequential optimization problem, Bayesian optimization needs to select the best observation value at each iteration, which is the key issue in Bayesian optimization. This key issue is perfectly solved by the aforementioned Gaussian process. A wealth of mathematical principles regarding Bayesian optimization, including Gaussian processes, acquisition functions, Upper Confidence Bound (UCB), and Expectation Improvements (EI), will not be elaborated on in this section due to space limitations. Bayesian optimization can be directly implemented using the third-party library BayesianOptimization. An example is shown in Code 7.

Code 7 Bayesian Optimization

### Bayesian Optimization Example Based on XGBoost
# Import xgboost module
import xgboost as xgb
# Import Bayesian optimization module
from bayes_opt import BayesianOptimization
# Define the objective optimization function
def xgb_evaluate(min_child_weight,                 colsample_bytree,                 max_depth,                 subsample,                 gamma,                 alpha):    # Specify the hyperparameters to optimize    params['min_child_weight'] = int(min_child_weight)    params['cosample_bytree'] = max(min(colsample_bytree, 1), 0)    params['max_depth'] = int(max_depth)    params['subsample'] = max(min(subsample, 1), 0)    params['gamma'] = max(gamma, 0)    params['alpha'] = max(alpha, 0)    # Define xgb cross-validation results    cv_result = xgb.cv(params, dtrain, num_boost_round=num_rounds, nfold=5,                   seed=random_state,                   callbacks=[xgb.callback.early_stop(50)])    return cv_result['test-auc-mean'].values[-1]
# Define relevant parameters
num_rounds = 3000
random_state = 2021
num_iter = 25
init_points = 5
params = {    'eta': 0.1,    'silent': 1,    'eval_metric': 'auc',    'verbose_eval': True,    'seed': random_state}
# Create Bayesian optimization instance
# and set parameter search range
xgbBO = BayesianOptimization(xgb_evaluate,                              {'min_child_weight': (1, 20),                               'colsample_bytree': (0.1, 1),                               'max_depth': (5, 15),                               'subsample': (0.5, 1),                               'gamma': (0, 10),                               'alpha': (0, 10),                                })
# Execute tuning process
xgbBO.maximize(init_points=init_points, n_iter=num_iter)

Code 7 provides an example of Bayesian optimization based on XGBoost. Before executing Bayesian optimization, we need to define an objective function for optimization based on XGBoost’s cross-validation xgb.cv, obtaining the xgb.cv cross-validation results and using the test set AUC as the precision metric for optimization. Finally, we pass the defined objective optimization function and hyperparameter search range into the Bayesian optimization function BayesianOptimization, specify the initialization points and number of iterations, and execute Bayesian optimization.

Figure 2 Bayesian Optimization Results

Part of the optimization process is shown in Figure 2. It can be seen that Bayesian optimization reached its optimal point during the 23rd iteration, with the parameters alpha at 4.099, column sampling ratio at 0.1, gamma at 0, maximum tree depth at 5, minimum child weight at 5.377, and subsampling ratio at 1.0, resulting in a test set AUC of 0.72.

Conclusion

This chapter provides a simple comprehensive comparison based on the previous chapters’ content on ensemble learning, along with commonly used hyperparameter tuning methods and examples. We conducted a performance comparison in terms of accuracy and speed for the three commonly used Boosting ensemble learning models: XGBoost, LightGBM, and CatBoost, using specific data instances. However, due to the specific dataset and tuning differences, the comparison results should only be used for demonstration purposes and do not truly represent that the LightGBM model is necessarily superior to the CatBoost model.

The three commonly used hyperparameter tuning methods: grid search, random search, and Bayesian optimization were introduced. This chapter also provided examples of using these three hyperparameter tuning methods based on the same dataset, but due to space constraints, we did not delve deeply into the mathematical principles of each method.

Previous Highlights:
【Original Release】 Machine Learning Formula Derivation and Code Implementation 30 Lectures.pdf
【Original Release】 Deep Learning Semantic Segmentation Theory and Practical Guide.pdf
 Discussing Algorithm Positions in Small and Medium Enterprises
 Algorithm Engineer Development Skill Set
 Those who truly want to do algorithms should not fear competition
 The daily life of algorithm engineers should not deviate from industrial practice
 Technical learning should not be superficial
 Technical people should learn self-marketing
 One should not overfit in life
Click to view

Leave a Comment Cancel reply