XGBoost Chinese Documentation Now Open

Organized by Machine Heart

Author: Jiang Siyuan

Recently, ApacheCN has opened the XGBoost Chinese documentation project, which provides installation steps, usage tutorials, and tuning tips related to XGBoost. The project has currently completed 90% of the original English documentation, and Machine Heart briefly introduces this documentation and hopes that readers can help improve it.

Gradient boosting trees have proven effective in predictive mining for classification and regression tasks. Previously, the boosting tree algorithm we chose was MART (multiple additive regression tree). However, since 2015, a new and consistently winning algorithm has emerged: XGBoost. This algorithm re-implements tree boosting and has achieved excellent results in Kaggle and other data science competitions, thus gaining popularity.

Before introducing XGBoost proposed by Chen Tianqi and others, we need to understand some concepts about boosting methods. Firstly, boosting methods are learning algorithms that use multiple simpler models to fit data, and these simpler models are also known as base learners or weak learners. It learns multiple classifiers by changing the weights of the training samples and combines these classifiers linearly to enhance classification performance.

For the AdaBoost algorithm, it increases the weights of samples misclassified by the previous round of weak classifiers while decreasing the weights of correctly classified samples. Thus, those incorrectly classified data receive greater attention from the next round of weak classifiers due to the increased weights, allowing the classification problem to be addressed by a series of weak classifiers.

If we combine the boosting algorithm with tree methods, we can build the boosting tree algorithm, which has yielded outstanding results in many Kaggle competitions. The boosted tree model can be viewed as an adaptive basis function model, where the basis functions are classification regression trees. The boosted tree model is the sum of multiple tree models, hence it is also referred to as tree ensemble or additive tree model. Generally, boosted trees often use very shallow classification regression trees, meaning trees with only a few leaf nodes. Compared to deeper trees, such trees have lower variance but greater bias.

Therefore, with the help of boosted tree models (i.e., adaptive neighborhood determination), MART and XGBoost can generally achieve better fitting than other methods. They can perform automatic feature selection and capture higher-order interactions without crashing.

By comparing MART and XGBoost, although MART indeed sets the same number of leaf nodes for all trees, XGBoost deepens the trees by setting Tmax and a regularization parameter while still keeping the variance low. Compared to the gradient boosting of MART, the Newton boosting used by XGBoost is likely to learn better structures. XGBoost also includes an additional randomization parameter, namely column subsampling, which helps further reduce the correlation of each tree.

In summary, XGBoost outperforms the general MART algorithm in many aspects, bringing an improved method for boosting trees.

In the recent ApacheCN project, they have opened a Chinese documentation for XGBoost: scalable and flexible gradient boosting.

XGBoost Chinese Documentation Now Open

This project not only includes a complete installation guide:

The installation page provides instructions on how to build and install the xgboost package on various operating systems. It consists of the following two steps:

1. First, build the shared library from C++ code (libxgboost.so for linux/osx and libxgboost.dll for windows).

  • Exception: For the installation of the R package, please refer directly to the R package section.

2. Then, install the relevant programming language packages (e.g., Python package).

Important: The latest version of xgboost uses submodules to maintain the package, so when you clone the repo, remember to use the recursive option as follows.

git clone --recursive https://github.com/dmlc/xgboost

For Windows users using GitHub tools, you can open Git Shell and enter the following commands.

git submodule init
git submodule update

If you encounter any issues during installation, please first refer to the troubleshooting section. If the instructions there do not suit you, feel free to ask questions on xgboost-doc-zh/issues, or if you can resolve the issue, it is best to initiate a pull request.

Contents of the installation page

  • Build Shared Library

  1. Build on Ubuntu/Debian

  2. Build on OSX

  3. Build on Windows

  4. Custom Build

  • Python Package Installation

  • R Package Installation

  • Troubleshooting

In addition to installation, the learning tutorial page also covers boosting trees, distributed XGBoost on AWS YARN, and DART Booster. These three tutorials have detailed derivations or implementation steps and are the official tutorials in the XGBoost package.

When we officially use XGBoost, a very important step is parameter tuning. The parameter tuning section of this documentation explains how to understand the Bias-Variance trade-off, control overfitting, and handle imbalanced datasets.

Furthermore, this documentation provides essential details about the XGBoost runtime process, including data interfaces such as libsvm text format, Numpy 2D arrays, and xgboost binary cache files, as well as parameter settings, training processes, predictions, and plotting methods. Here is a summary of using Python:

import xgboost as xgb
# Read data
dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')
dtest = xgb.DMatrix('demo/data/agaricus.txt.test')
# Specify parameters through map
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# Predict
preds = bst.predict(dtest)

This article is organized by Machine Heart, please contact this public account for authorization to reprint.

✄————————————————

Join Machine Heart (Full-time reporter/intern): [email protected]

Submissions or seeking reports: [email protected]

Advertising & Business Cooperation: [email protected]

Leave a Comment