XGBoost Tutorial: A Comprehensive Guide

XGBoost Tutorial: A Comprehensive Guide

Source: Machine Learning Algorithms

This article is about 8400 words long and is recommended for a 10-minute read.
This article provides a detailed explanation of the engineering application methods of XGBoost.

The illustrated machine learning practical application demonstrates the application process and chain of machine learning algorithms in a case-driven and code-driven manner, mastering the ability to build scenario modeling solutions and perform effect tuning.This article provides a detailed explanation of the engineering application methods of XGBoost. XGBoost is a very powerful boosting algorithm toolkit and is the model of choice for many large companies’ machine learning solutions, demonstrating excellent performance in parallel computing efficiency, handling missing values, and controlling overfitting.

XGBoost Tutorial: A Comprehensive Guide

https://www.showmeai.tech/article-detail/204

XGBoost is short for eXtreme Gradient Boosting, which is a very powerful boosting algorithm toolkit. Its excellent performance (effectiveness and speed) has kept it at the top of the data science competition solution rankings for a long time. Many large companies still prefer this model for their machine learning solutions. XGBoost demonstrates excellent performance in parallel computing efficiency, handling missing values, controlling overfitting, and predictive generalization ability.

1. XGBoost Installation

XGBoost, as a commonly used powerful Python machine learning tool library, is relatively easy to install.

Python and IDE Environment Setup

XGBoost Tutorial: A Comprehensive Guide

The Python environment and IDE setup can refer to the ShowMeAI article on illustrated Python | Installation and Environment Setup [2].

Library Installation

(1) Linux/Mac and Other Systems

For installing XGBoost on these systems, you can easily complete it based on pip by entering the following command in the command line and waiting for the installation to finish.

pip install xgboost

You can also choose domestic pip sources for better installation speed.

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple xgboost
XGBoost Tutorial: A Comprehensive Guide

(2) Windows System

For Windows systems, a more efficient and convenient installation method is to download the corresponding version of the XGBoost installation package from the websitehttp://www.lfd.uci.edu/~gohlke/pythonlibs/ and install it using the following command.

pip install xgboost‑1.5.1‑cp310‑cp310‑win32.whl

2. Reading Data with XGBoost

The first step in applying XGBoost is to load the required data into a format supported by the library. XGBoost can load data in various formats for training modeling:

  • Text data in libsvm format.
  • 2D arrays of Numpy.
  • XGBoost’s binary cache files. The loaded data is stored in the DMatrix object.

XGBoost’s SKLearn interface also supports Dataframe format data (refer to the ShowMeAI article on Python Data Analysis | Core Functions of Pandas [3] for processing).Below are the loading methods for different formats of data in XGBoost.

  • Load data in libsvm format
dtrain1 = xgb.DMatrix('train.svm.txt')
  • Load binary cache files
dtrain2 = xgb.DMatrix('train.svm.buffer')
  • Load Numpy arrays
data = np.random.rand(5,10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix( data, label=label)
  • Convert<strong>scipy.sparse</strong> format data to <strong>DMatrix</strong> format
csr = scipy.sparse.csr_matrix( (dat, (row,col)) )
dtrain = xgb.DMatrix( csr ) 
  • Save<strong>DMatrix</strong> format data as XGBoost’s binary format to improve loading speed during the next load, using the following method
dtrain = xgb.DMatrix('train.svm.txt')
dtrain.save_binary("train.buffer")
  • Handle missing values in<strong>DMatrix</strong> as follows
dtrain = xgb.DMatrix( data, label=label, missing = -999.0)
  • When you need to set sample weights, you can use the following method
w = np.random.rand(5,1)
dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=w)

3. Different Modeling Methods of XGBoost

Built-in Modeling Method: libsvm Format Data Source

XGBoost has built-in modeling methods with the following data formats and core training methods:

  • Data in DMatrix format
  • Training based on the xgb.train interface

Below is a simple example from the official documentation demonstrating the process of reading libsvm format data (to DMatrix format) and specifying parameters for modeling.

# Import the library
import numpy as np
import scipy.sparse
import pickle
import xgboost as xgb

# Read data from libsvm file for binary classification
# The data is in libsvm format, as follows
#1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
#0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1
#0 1:1 10:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 122:1
dtrain = xgb.DMatrix('./data/agaricus.txt.train')
dtest = xgb.DMatrix('./data/agaricus.txt.test')

# Set hyperparameters
# Mainly tree depth, learning rate, objective function
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }

# Set watchlist to observe model status during modeling
watchlist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)

# Use the model for prediction
preds = bst.predict(dtest)

# Check accuracy
labels = dtest.get_label()
print('Error rate: %f' % 
       (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))

# Save model
bst.save_model('./model/0001.model')
XGBoost Tutorial: A Comprehensive Guide
[0]  eval-error:0.042831  train-error:0.046522
[1]  eval-error:0.021726  train-error:0.022263
Error rate: 0.021726

Built-in Modeling Method: csv Format Data Source

In the following example, the input data source is a csv file. We use the familiar Pandas library (refer to ShowMeAI tutorial on Data Analysis Series [4] and Data Science Tools Quick Reference | Pandas User Guide [5]) to read the data into Dataframe format, then build Dmatrix format input, and subsequently train using the built-in modeling method.

# Pima Indian Diabetes dataset includes many fields: number of pregnancies, plasma glucose concentration during oral glucose tolerance test, diastolic pressure (mm Hg), triceps skin fold thickness (mm),
# 2-hour serum insulin (μU/ml), body mass index (kg/(height(m)^2)), diabetes pedigree function, age (years)
import pandas as pd
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
data.head()
XGBoost Tutorial: A Comprehensive Guide
# Import the library
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split

# Read data with pandas
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

# Split the data
train, test = train_test_split(data)

# Convert to Dmatrix format
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'

# Extract numpy array values from Dataframe to initialize DMatrix object
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)

# Set parameters
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }

# Set watchlist to observe model status
watchlist  = [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)

# Use the model for prediction
preds = bst.predict(xgtest)

# Check accuracy
labels = xgtest.get_label()
print('Error class: %f' % 
       (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))

# Save model
bst.save_model('./model/0002.model')
XGBoost Tutorial: A Comprehensive Guide
[0]  eval-error:0.354167  train-error:0.194444
[1]  eval-error:0.34375   train-error:0.170139
[2]  eval-error:0.322917  train-error:0.170139
[3]  eval-error:0.28125   train-error:0.161458
[4]  eval-error:0.302083  train-error:0.147569
[5]  eval-error:0.286458  train-error:0.138889
[6]  eval-error:0.296875  train-error:0.142361
[7]  eval-error:0.291667  train-error:0.144097
[8]  eval-error:0.302083  train-error:0.130208
[9]  eval-error:0.291667  train-error:0.130208
Error class: 0.291667

Estimator Modeling Method: SKLearn Interface + Dataframe

XGBoost also supports modeling using the unified estimator interface in SKLearn. Below is a typical reference case for training sets and test sets read in Dataframe format, where you can directly use XGBoost to initialize XGBClassifier for fitting training. The usage method and interface are consistent with other estimators in SKLearn.

# Import the library
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split

# Read data with pandas
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

# Split the data
train, test = train_test_split(data)

# Feature columns
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
# Target column
target_column = 'Outcome'

# Initialize model
xgb_classifier = xgb.XGBClassifier(n_estimators=20,\
                                   max_depth=4, \
                                   learning_rate=0.1, \
                                   subsample=0.7, \
                                   colsample_bytree=0.7, \
                                   eval_metric='error')

# Fit model to Dataframe format data
xgb_classifier.fit(train[feature_columns], train[target_column])

# Use the model for prediction
preds = xgb_classifier.predict(test[feature_columns])

# Check accuracy
print('Error class: %f' %((preds!=test[target_column]).sum()/float(test_y.shape[0])))

# Save model
joblib.dump(xgb_classifier, './model/0003.model')
XGBoost Tutorial: A Comprehensive Guide
Error class: 0.265625

['./model/0003.model']

4. Model Parameter Tuning and Advanced Features

XGBoost Parameter Explanation

XGBoost Tutorial: A Comprehensive Guide

Before running XGBoost, you must set three types of parameters: General parameters, Booster parameters, and Task parameters:

General parameters:

This parameter controls which booster is used during the boosting process. Common boosters include tree model (tree) and linear model (linear model).

Booster parameters:

This depends on which booster is used and includes parameters for tree model booster and linear booster.

Task parameters:

Control the learning scenario, for example, different parameters are used for regression problems.

(1) General Parameters

XGBoost Tutorial: A Comprehensive Guide

booster [default=gbtree]

There are two models to choose from: gbtree and gblinear. gbtree uses tree-based models for boosting calculations, while gblinear uses linear models for boosting calculations. The default value is gbtree.

silent [default=0]

Setting this to 1 will print runtime information, while setting it to 0 will run silently without printing runtime information. The default value is 0.

nthread

The number of threads XGBoost runs. The default value is the maximum number of threads available on the current system.

num_pbuffer

The size of the prediction buffer, usually set to the number of training instances. The buffer is used to save the prediction results of the last boosting step and does not need to be set manually.

num_feature

The number of features used in boosting, set to the number of features. XGBoost will set this automatically and does not need to be set manually.

(2) Tree Model Booster Parameters

XGBoost Tutorial: A Comprehensive Guide

eta [default=0.3]

To prevent overfitting, the shrinkage step used in the update process. After each boosting calculation, the algorithm directly obtains the weights of the new features.eta makes the boosting calculation process more conservative by reducing the weights of the features. The default value is 0.3 with a range of (0,1).

gamma [default=0]

The minimum loss reduction required for a tree to further split and grow. The larger the value, the more conservative the algorithm will be. The range is (0,∞).

max_depth [default=6]

The maximum depth of the tree. The default value is 6 with a range of (0,∞).

min_child_weight [default=1]

The minimum sum of sample weights in a child node. If the sum of sample weights in a leaf node is less than min_child_weight, the splitting process ends. In current regression models, this parameter refers to the minimum number of samples required to build each model. The larger this value, the more conservative the algorithm will be. The range is (0,∞).

max_delta_step [default=0]

The maximum value allowed for estimating the weight of each tree. If set to 0, it means no constraints; if set to a positive value, it can make the updates more conservative. This parameter is usually not necessary, but if the class is extremely unbalanced in logistic regression, it may help. Setting the range to between 1 and 10 may control the updates. The range is (0,∞).

subsample [default=1]

The proportion of the subsample used to train the model from the entire sample set. If set to 0.5, it means XGBoost will randomly draw a subsample of 0.5 from the entire sample set to build the tree model, which can prevent overfitting. The range is (0,1).

colsample_bytree [default=1]

The proportion of features sampled when building the tree. The default value is 1 with a range of (0,1).

(3) Linear Booster Parameters

XGBoost Tutorial: A Comprehensive Guide

lambda [default=0]

L2 regularization penalty coefficient.

alpha [default=0]

L1 regularization penalty coefficient.

lambda_bias

L2 regularization on bias. The default value is 0 (there is no regularization on the bias term in L1 because the bias is not important in L1).

(4) Task Parameters

XGBoost Tutorial: A Comprehensive Guide

objective [ default=reg:linear ]

Defines the learning task and corresponding learning objectives.

Optional objective functions include:

  • reg:linear: Linear regression.

  • reg:logistic: Logistic regression.

  • binary:logistic: Binary classification logistic regression problem, output as probability.

  • binary:logitraw: Binary classification logistic regression problem, output result as wTx.

  • count:poisson: Poisson regression for counting problems, output result as Poisson distribution. In Poisson regression, the default value of max_delta_step is 0.7 (used to safeguard optimization).

  • multi:softmax: Allows XGBoost to use the softmax objective function to handle multi-class problems, while the parameter num_class (number of classes) must be set.

  • multi:softprob: Similar to softmax, but outputs a vector of ndata * nclass, which can be reshaped into a matrix of ndata rows and nclass columns. Each row indicates the probability of the sample belonging to each class.

  • rank:pairwise: Set XGBoost to do ranking tasks by minimizing the pairwise loss.

base_score [ default=0.5 ]

  • Initial prediction score for all instances, global bias;

  • Changing this value will not have a significant impact on sufficient iterations.

eval_metric [ default according to objective ]

Evaluation metrics required for validation data. Different objective functions will have default evaluation metrics (rmse for regression, and error for classification, mean average precision for ranking).

Users can add multiple evaluation metrics. For Python users, parameters should be passed as a list to the program, rather than mapping parameters; list parameters will not overwrite eval_metric.

Available options include:

  • rmse: root mean square error

  • logloss: negative log-likelihood

  • error: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.

  • merror: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).

  • mlogloss: Multiclass logloss

  • auc: Area under the curve for ranking evaluation.

  • ndcg: Normalized Discounted Cumulative Gain

  • map: Mean average precision

  • ndcg@n, map@n: n can be assigned as an integer to cut off the top positions in the lists for evaluation.

  • ndcg-, map-, ndcg@n-, map@n-: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding – in the evaluation metric, XGBoost will evaluate these scores as 0 to be consistent under some conditions.

seed [ default=0 ]

Random seed. The default value is 0.

Built-in Parameter Optimization

(1) Cross-validation

XGBoost comes with some methods for experiments and parameter tuning, such as the cross-validation method xgb.cv.

xgb.cv(param, dtrain, num_round, nfold=5, metrics={'error'}, seed = 0)
XGBoost Tutorial: A Comprehensive Guide

(2) Adding Preprocessing

We can add some settings during the modeling process to the cross-validation phase, such as weighting different categories of samples. You can refer to the following code example:

# Calculate the ratio of positive to negative samples and adjust sample weights
def fpreproc(dtrain, dtest, param):
    label = dtrain.get_label()
    ratio = float(np.sum(label == 0)) / np.sum(label==1)
    param['scale_pos_weight'] = ratio
    return (dtrain, dtest, param)

# First preprocess to calculate sample weights, then perform cross-validation
xgb.cv(param, dtrain, num_round, nfold=5,
       metrics={'auc'}, seed = 0, fpreproc = fpreproc)
XGBoost Tutorial: A Comprehensive Guide
XGBoost Tutorial: A Comprehensive Guide

(3) Custom Loss Function and Evaluation Criteria

XGBoost supports custom loss functions and evaluation criteria during training. The definition of the loss function needs to return the first and second derivatives of the loss function, while the evaluation criteria need to compute the difference between the data’s label and predicted values. The loss function is used in the tree structure learning during the training process, while the evaluation criteria are often used for effect evaluation on the validation set.

print('Using custom loss function for cross-validation')
# Custom loss function, needs to provide first and second derivatives of the loss function
def logregobj(preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0-preds)
    return grad, hess

# Custom evaluation criteria to evaluate the difference between predicted values and standard answers
def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'error', float(sum(labels != (preds > 0.0))) / len(labels)

watchlist  = [(dtest,'eval'), (dtrain,'train')]
param = {'max_depth':3, 'eta':0.1, 'silent':1}
num_round = 5
# Custom loss function training
bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)
# Cross-validation
xgb.cv(param, dtrain, num_round, nfold = 5, seed = 0, obj = logregobj, feval=evalerror)
XGBoost Tutorial: A Comprehensive Guide
XGBoost Tutorial: A Comprehensive Guide
Using custom loss function for cross-validation
[0]  eval-rmse:0.306901   train-rmse:0.306164  eval-error:0.518312  train-error:0.517887
[1]  eval-rmse:0.179189   train-rmse:0.177278  eval-error:0.518312  train-error:0.517887
[2]  eval-rmse:0.172565   train-rmse:0.171728  eval-error:0.016139  train-error:0.014433
[3]  eval-rmse:0.269612   train-rmse:0.27111   eval-error:0.016139  train-error:0.014433
[4]  eval-rmse:0.396903   train-rmse:0.398256  eval-error:0.016139  train-error:0.014433

(4) Use Only the First n Trees for Prediction

For boosting models, many base learners (in XGBoost, often many trees) will be trained. You can complete predictions using only the ensemble of the first n trees after full training.

#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split

# Basic example, read data from csv file, perform binary classification

# Read data with pandas
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

# Split the data
train, test = train_test_split(data)

# Convert to Dmatrix format
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)

# Set parameters
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }

# Set watchlist to observe model status
watchlist  = [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)

# Use only the 1st tree for prediction
ypred1 = bst.predict(xgtest, ntree_limit=1)
# Use the first 9 trees for prediction
ypred2 = bst.predict(xgtest, ntree_limit=9)
label = xgtest.get_label()
print('Error rate using the first tree: %f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))
print('Error rate using the first 9 trees: %f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))
XGBoost Tutorial: A Comprehensive Guide
[0]  eval-error:0.255208  train-error:0.196181
[1]  eval-error:0.234375  train-error:0.175347
[2]  eval-error:0.25   train-error:0.163194
[3]  eval-error:0.229167  train-error:0.149306
[4]  eval-error:0.213542  train-error:0.154514
[5]  eval-error:0.21875   train-error:0.152778
[6]  eval-error:0.21875   train-error:0.154514
[7]  eval-error:0.213542  train-error:0.138889
[8]  eval-error:0.1875 train-error:0.147569
[9]  eval-error:0.1875 train-error:0.144097
Error rate using the first tree: 0.255208
Error rate using the first 9 trees: 0.187500

Estimator Parameter Optimization

(1) SKLearn Style Interface Experimental Evaluation

XGBoost has an interface in the form of SKLearn estimators, and the overall usage method is consistent with other estimators in SKLearn. Below is a manual cross-validation example, noting that here we directly use XGBClassifier to fit and evaluate Dataframe data.

import pickle
import xgboost as xgb

import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, load_boston

rng = np.random.RandomState(31337)

# Binary classification: confusion matrix
print("Binary classification problem for digits 0 and 1")
digits = load_digits(2)
y = digits['target']
X = digits['data']
# Data splitting object
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("Cross-validation on 2-fold data")
# 2-fold cross-validation
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print("Confusion matrix:")
    print(confusion_matrix(actuals, predictions))

# Multiclass: confusion matrix
print("\nIris: Multiclass")
iris = load_iris()
y = iris['target']
X = iris['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("Cross-validation on 2-fold data")
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print("Confusion matrix:")
    print(confusion_matrix(actuals, predictions))

# Regression problem: MSE
print("\nBoston housing price regression prediction problem")
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("Cross-validation on 2-fold data")
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBRegressor().fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print("MSE:",mean_squared_error(actuals, predictions))
Binary classification problem for digits 0 and 1
Cross-validation on 2-fold data
Confusion matrix:
[[87  0]
 [ 1 92]]
Confusion matrix:
[[91  0]
 [ 3 86]]

Iris: Multiclass
Cross-validation on 2-fold data
Confusion matrix:
[[19  0  0]
 [ 0 31  3]
 [ 0  1 21]]
Confusion matrix:
[[31  0  0]
 [ 0 16  0]
 [ 0  3 25]]

Boston housing price regression prediction problem
Cross-validation on 2-fold data
MSE: 9.860776812557337
MSE: 15.942418468446029

(2) Grid Search Parameter Tuning

As mentioned, the XGBoost estimator interface has an overall usage method consistent with other estimators in SKLearn, so we can also use hyperparameter tuning methods in SKLearn for model tuning.Below is a typical code example for tuning hyperparameters using grid search methods, where we provide a candidate parameter list dictionary and use GridSearchCV for cross-validation experimental evaluation to select the optimal hyperparameters for XGBoost from the candidate parameters.

print("Parameter Optimization:")
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
                   {'max_depth': [2,4,6],
                    'n_estimators': [50,100,200]}, verbose=1)
clf.fit(X,y)
print(clf.best_score_)
print(clf.best_params_)
XGBoost Tutorial: A Comprehensive Guide
Parameter Optimization:
Fitting 3 folds for each of 9 candidates, totaling 27 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

0.6001029721598573
{'max_depth': 4, 'n_estimators': 100}

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    1.3s finished

(3) Early Stopping

XGBoost models can sometimes overfit the training set due to continuously adding new trees (correcting some samples that are not fitted correctly on the training set). Early stopping is an effective strategy where, during the process of continuously adding trees to learn from the training set, the performance on the validation set is monitored. If there is no improvement in the evaluation criteria for a certain number of rounds, it rolls back to the best point on the validation set and saves it as the best model.Below is the corresponding code example, where the parameter early_stopping_rounds sets the maximum number of rounds allowed without improvement on the validation set, and eval_set specifies the validation dataset.

# Learn the model on the training set, adding one tree at a time, monitoring the effect on the validation set. When the effect on the validation set no longer improves, stop adding and growing trees
X = digits['data']
y = digits['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier()
clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="auc",
        eval_set=[(X_val, y_val)])
XGBoost Tutorial: A Comprehensive Guide
[0]  validation_0-auc:0.999497
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]  validation_0-auc:0.999497
[2]  validation_0-auc:0.999497
[3]  validation_0-auc:0.999749
[4]  validation_0-auc:0.999749
[5]  validation_0-auc:0.999749
[6]  validation_0-auc:0.999749
[7]  validation_0-auc:0.999749
[8]  validation_0-auc:0.999749
[9]  validation_0-auc:0.999749
[10] validation_0-auc:1
[11] validation_0-auc:1
[12] validation_0-auc:1
[13] validation_0-auc:1
[14] validation_0-auc:1
[15] validation_0-auc:1
[16] validation_0-auc:1
[17] validation_0-auc:1
[18] validation_0-auc:1
[19] validation_0-auc:1
[20] validation_0-auc:1
Stopping. Best iteration:
[10] validation_0-auc:1

(4) Feature Importance

During the modeling process, XGBoost can also learn the corresponding feature importance information, which is saved in the model’s feature_importances_ attribute. Below is the code for visualizing feature importance:

iris = load_iris()
y = iris['target']
X = iris['data']
xgb_model = xgb.XGBClassifier().fit(X,y)

print('Feature ranking:')
feature_names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
feature_importances = xgb_model.feature_importances_
indices = np.argsort(feature_importances)[::-1]

for index in indices:
    print("Feature %s importance: %f" %(feature_names[index], feature_importances[index]))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
plt.title("Feature importances")
plt.bar(range(len(feature_importances)), feature_importances[indices], color='b')
plt.xticks(range(len(feature_importances)), np.array(feature_names)[indices], color='b')
XGBoost Tutorial: A Comprehensive Guide
Feature ranking:
Feature petal_length importance: 0.415567
Feature petal_width importance: 0.291557
Feature sepal_length importance: 0.179420
Feature sepal_width importance: 0.113456
XGBoost Tutorial: A Comprehensive Guide

(5) Parallel Training Acceleration

In multi-resource situations, XGBoost can achieve parallel training acceleration. Below is a sample code:

import os

if __name__ == "__main__":
    try:
        from multiprocessing import set_start_method
    except ImportError:
        raise ImportError("Unable to import multiprocessing.set_start_method."
                          " This example only runs on Python 3.4")
    #set_start_method("forkserver")

    import numpy as np
    from sklearn.model_selection import GridSearchCV
    from sklearn.datasets import load_boston
    import xgboost as xgb

    rng = np.random.RandomState(31337)

    print("Parallel Parameter optimization")
boston = load_boston()

    os.environ["OMP_NUM_THREADS"] = "2"  # or to whatever you want
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model, {'max_depth': [2, 4, 6],
                                   'n_estimators': [50, 100, 200]}, verbose=1,
                       n_jobs=2)
clf.fit(X, y)
print(clf.best_score_)
print(clf.best_params_)
Parallel Parameter optimization
Fitting 3 folds for each of 9 candidates, totaling 27 fits

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  24 out of 27 | elapsed:    2.2s remaining:    0.3s

0.6001029721598573
{'max_depth': 4, 'n_estimators': 100}

[Parallel(n_jobs=2)]: Done  27 out of  27 | elapsed:    2.4s finished

References

[1] Illustrated Machine Learning | Detailed Explanation of XGBoost Model: https://www.showmeai.tech/article-detail/194[2] Illustrated Python | Installation and Environment Setup: https://www.showmeai.tech/article-detail/65[3] Python Data Analysis | Core Functions of Pandas: https://www.showmeai.tech/article-detail/146[4] Data Analysis Series Tutorials: https://www.showmeai.tech/tutorials/33[5] Data Science Tools Quick Reference | Pandas User Guide: https://www.showmeai.tech/article-detail/101

Editor: Wang Jing

Proofreader: Qiu Tingting

XGBoost Tutorial: A Comprehensive Guide

Leave a Comment