Random Gradient Boosting with XGBoost and Scikit-Learn

A simple technique for integrating decision trees involves training trees on subsamples of the training dataset. A subset of rows from the training data can be used to train individual trees known as bagging. When a subset of rows from the training data is also used when calculating each split point, this is referred to as a random forest. These techniques can also be used in a technique called random gradient boosting for gradient tree boosting models.

In this article, you will discover random gradient boosting and how to adjust sampling parameters using XGBoost and scikit-learn in Python. After reading this article, you will know:

The principles of training trees on data subsamples and how to apply it to gradient boosting.
How to adjust row-based subsampling in XGBoost using scikit-learn.
How to adjust column-based subsampling by tree and split point in XGBoost.

Random Gradient Boosting

Gradient boosting is a greedy process. New decision trees are added to the model to correct the residuals of the existing model. A greedy search process is used to create each decision tree, selecting the split points that minimize the objective function. This may lead to trees repeatedly using the same attributes, or even the same split points.

Bagging is a technique for creating a collection of decision trees, each derived from a different random subset of rows from the training data. The effect is that the randomness of the samples allows for the creation of slightly different trees, resulting in better performance from the collection of trees, thereby increasing the variance of predictions for the ensemble. Random forests further expand this step by performing secondary sampling of features (columns) when selecting split points, thereby further increasing the overall diversity of the trees. These same techniques can be applied to the construction of decision trees in gradient boosting, and this variation is called random gradient boosting. Typically, aggressive subsampling of the training data is used, such as 40% to 80%.

Tutorial Overview

In this tutorial, we will explore the role of different secondary sampling techniques in gradient boosting. We will adjust three different random gradient boosting methods supported by the XGBoost library in Python, specifically:

Secondary sampling of rows in the dataset when creating each tree.
Secondary sampling of columns in the dataset when creating each tree.
Subsampling of columns for each split in the dataset when creating each tree.

Problem Description: Otto Dataset

In this tutorial, we will use the “Otto Group Product Classification Challenge” dataset. This dataset is available for free from Kaggle (you need to register on Kaggle to download this dataset). You can download the training dataset train.csv.zip from the “Data” page and place the extracted train.csv file in your working directory. This dataset describes over 61,000 products with 93 confusing details, classified into 10 product categories (e.g., fashion, electronics, etc.). The input attributes are counts of various distinct events. The goal is to predict new products as an array of probabilities for each of the 10 categories, and to evaluate the model using multiclass log loss (also known as cross-entropy). The competition concluded in May 2015, and due to the limited number of examples and the high difficulty of the problem, requiring almost no data preparation (except for encoding string categorical variables as integers), this dataset poses a significant challenge for XGBoost.

Adjusting Row Subsampling in XGBoost

Row subsampling involves selecting random samples from the training dataset without replacement. Row subsampling can be specified in the subsample parameter of the scikit-learn wrapper for the XGBoost class. The default value is 1.0, which means no subsampling is performed. We can use the built-in grid search functionality in scikit-learn to evaluate the impact of different subsample values ranging from 0.1 to 1.0 on the Otto dataset.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

There are 9 variants of subsample, and each model will be evaluated using 10-fold cross-validation, meaning 9×10 or 90 models need to be trained and tested.

The complete code listing is provided below.

# XGBoost on Otto dataset, tune subsample
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
subsample = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]
param_grid = dict(subsample=subsample)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
 print("%f (%f) with: %r" % (mean, stdev, param))
# plot
pyplot.errorbar(subsample, means, yerr=stds)
pyplot.title("XGBoost subsample vs Log Loss")
pyplot.xlabel('subsample')
pyplot.ylabel('Log Loss')
pyplot.savefig('subsample.png')

Running this example will print the best configuration and the log loss for each test configuration.

Note: Due to the randomness of the algorithm or evaluation procedure, or differences in numerical precision, your results may vary. Consider running the example several times and comparing the average results.

We can see that the best result obtained is 0.3, or using 30% of the training dataset sample to train the trees.

Best: -0.000647 using {'subsample': 0.3}
-0.001156 (0.000286) with: {'subsample': 0.1}
-0.000765 (0.000430) with: {'subsample': 0.2}
-0.000647 (0.000471) with: {'subsample': 0.3}
-0.000659 (0.000635) with: {'subsample': 0.4}
-0.000717 (0.000849) with: {'subsample': 0.5}
-0.000773 (0.000998) with: {'subsample': 0.6}
-0.000877 (0.001179) with: {'subsample': 0.7}
-0.001007 (0.001371) with: {'subsample': 0.8}
-0.001239 (0.001730) with: {'subsample': 1.0}

We can plot these mean and standard deviation log loss values to better understand how performance changes with subsample values.

We can see that indeed 30% provides the best average performance, but we can also see that as the ratio increases, the differences in performance become substantial. Interestingly, the average performance of all subsample values outperforms the average performance of no subsampling (subsample = 1.0).

Adjusting Column Subsampling by Tree in XGBoost

We can also create a random sample of the features (or columns) to be used before creating each decision tree in the boosting model. In the scikit-learn XGBoost wrapper, this is controlled by the colsample_bytree parameter. The default value is 1.0, meaning all columns are used in each decision tree. We can evaluate values of colsample_bytree ranging from 0.1 to 1.0 in increments of 0.1.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

The complete example is as follows:

# XGBoost on Otto dataset, tune colsample_bytree
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
colsample_bytree = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]
param_grid = dict(colsample_bytree=colsample_bytree)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
 print("%f (%f) with: %r" % (mean, stdev, param))
# plot
pyplot.errorbar(colsample_bytree, means, yerr=stds)
pyplot.title("XGBoost colsample_bytree vs Log Loss")
pyplot.xlabel('colsample_bytree')
pyplot.ylabel('Log Loss')
pyplot.savefig('colsample_bytree.png')

Running this example will print the best configuration and the log loss for each test configuration.

Note: Due to the randomness of the algorithm or evaluation procedure, or differences in numerical precision, your results may vary.

We can see that the model’s best performance is colsample_bytree = 1.0. This indicates that subsampling does not add value for this problem.

Best: -0.001239 using {'colsample_bytree': 1.0}
-0.298955 (0.002177) with: {'colsample_bytree': 0.1}
-0.092441 (0.000798) with: {'colsample_bytree': 0.2}
-0.029993 (0.000459) with: {'colsample_bytree': 0.3}
-0.010435 (0.000669) with: {'colsample_bytree': 0.4}
-0.004176 (0.000916) with: {'colsample_bytree': 0.5}
-0.002614 (0.001062) with: {'colsample_bytree': 0.6}
-0.001694 (0.001221) with: {'colsample_bytree': 0.7}
-0.001306 (0.001435) with: {'colsample_bytree': 0.8}
-0.001239 (0.001730) with: {'colsample_bytree': 1.0}

By plotting the results, we can see the model’s stable performance segment (at least at this ratio), which ranges from 0.5 to 1.0.

Adjusting Column Subsampling by Split in XGBoost

It is not necessary to subsample columns for each tree; we can subsample them at each split of the decision tree. In principle, this is the method used in random forests. We can set the size of the column sample used for each split in the colsample_bylevel parameter of the scikit-learn XGBoost wrapper class. As before, we will change the ratio from 10% to the default of 100%.

The complete code listing is provided below.

# XGBoost on Otto dataset, tune colsample_bylevel
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
# load data
data = read_csv('train.csv')
dataset = data.values
# split data into X and y
X = dataset[:,0:94]
y = dataset[:,94]
# encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = XGBClassifier()
colsample_bylevel = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]
param_grid = dict(colsample_bylevel=colsample_bylevel)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
 print("%f (%f) with: %r" % (mean, stdev, param))
# plot
pyplot.errorbar(colsample_bylevel, means, yerr=stds)
pyplot.title("XGBoost colsample_bylevel vs Log Loss")
pyplot.xlabel('colsample_bylevel')
pyplot.ylabel('Log Loss')
pyplot.savefig('colsample_bylevel.png')

Running this example will print the best configuration and the log loss for each test configuration.

We can see that the best result is obtained by setting colsample_bylevel to 70%, resulting in a (negative) log loss of -0.001062, which is better than the -0.001239 seen when setting column sampling to 100% for each tree.

If the results for each tree suggest using 100% of the columns, it is advisable not to abandon column subsampling but to try subsampling columns by split.

Best: -0.001062 using {'colsample_bylevel': 0.7}
-0.159455 (0.007028) with: {'colsample_bylevel': 0.1}
-0.034391 (0.003533) with: {'colsample_bylevel': 0.2}
-0.007619 (0.000451) with: {'colsample_bylevel': 0.3}
-0.002982 (0.000726) with: {'colsample_bylevel': 0.4}
-0.001410 (0.000946) with: {'colsample_bylevel': 0.5}
-0.001182 (0.001144) with: {'colsample_bylevel': 0.6}
-0.001062 (0.001221) with: {'colsample_bylevel': 0.7}
-0.001071 (0.001427) with: {'colsample_bylevel': 0.8}
-0.001239 (0.001730) with: {'colsample_bylevel': 1.0}

We can plot the performance changes for each colsample_bylevel. The results indicate that after a value of 0.3 at this ratio, the variance is relatively low, and performance seems to stabilize.

Author: Yishui Hancheng, CSDN Blog Expert, personal research direction: Machine Learning, Deep Learning, NLP, CV

Blog: http://yishuihancheng.blog.csdn.net

Appreciate the Author

Leave a Comment Cancel reply