Su Gaosheng, graduated with a master’s degree in statistics from Southwestern University of Finance and Economics, currently working at China Telecom, mainly responsible for big data analysis and data modeling for existing enterprise customers. Research direction: Machine Learning, favorite programming language: R, no exceptions.
E-mail: [email protected]
Zero, Case Background Introduction and Modeling Idea Explanation
1. Background Introduction
The data used in this case is from the Kaggle competition “Santander Customer Satisfaction”. This case is an imbalanced binary classification problem, aiming to maximize the AUC value (area under the ROC curve). The competition link is: https://www.kaggle.com/c/santander-customer-satisfaction. This competition has now ended.
2. Modeling Idea
This document uses the mlr package in R (a comprehensive machine learning package) to call the XGBoost algorithm for classification. 1) Read data;
2) Data exploration: set up parallel computing, fill in missing values, observe whether the data categories are balanced, remove constant columns, and take fields contained in both the training dataset and the test dataset.
3) Feature selection:
I. Handle data category imbalance (methods can include oversampling/undersampling/ensemble, etc.), this case uses the oversampling method, the corresponding function in the mlr package is oversample, to initially determine the appropriate oversampling ratio;
II. Use the generateFilterValuesData function of the mlr package to obtain 95% information gain;
4) Parameter tuning: gradually debug parameters such as oversampling rate, eta, max_depth, min_child_weight, gamma, colsample_bytree, etc., and debug multiple times until satisfied;
5) Ensemble prediction results: randomly select parameter values within the appropriate range for each parameter to build the XGBoost model, and integrate multiple models to output prediction results; the ROC value of the output results of the program used in this case is .816584.
One, Read Data
options(java.parameters = “-Xmx8g”) ## Used during feature selection, but must be set before loading the package
library(readr)
xgb_tr1 <- read_csv(“C:/Users/Administrator/kaggle/scs/train.csv”)
xgb_te1 <- read_csv(“C:/Users/Administrator/kaggle/scs/test.csv”)
Two, Data Exploration
1. Set up parallel computing
library(dplyr)
library(mlr)
library(parallelMap)
parallelStartSocket(4)
2. Preliminary exploration of each column of data
summarizeColumns(xgb_tr1)
3. Handle missing values: impute missing values by mean and mode
imp_tr1 <- impute(
as.data.frame(xgb_tr1),
classes = list(
integer = imputeMean(),
numeric = imputeMean()
)
)
imp_te1 <- impute(
as.data.frame(xgb_te1),
classes = list(
integer = imputeMean(),
numeric = imputeMean()
)
)
4. Observe the proportion of training data categories – data category imbalance
table(xgb_tr1$TARGET)
5. Remove constant columns from the dataset
xgb_tr2 <- removeConstantFeatures(imp_tr1$data)
xgb_te2 <- removeConstantFeatures(imp_te1$data)
6. Retain the same columns in both the training dataset and the test dataset
tr2_name <- data.frame(tr2_name = colnames(xgb_tr2))te2_name <- data.frame(te2_name = colnames(xgb_te2))
tr2_name_inner <- tr2_name %>%
inner_join(te2_name, by = c(‘tr2_name’ = ‘te2_name’))
TARGET <- data.frame(TARGET = xgb_tr2$TARGET)xgb_tr2 <- xgb_tr2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]xgb_te2 <- xgb_te2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]xgb_tr2 <- cbind(xgb_tr2, TARGET)
Three, Feature Selection – Information Gain
1. Build Basic Task
xgb_tr2$TARGET <- factor(xgb_tr2$TARGET)
xgb.task <- makeClassifTask(data = xgb_tr2, target = ‘TARGET’)
set.seed(0)
2. Oversampling Grid Search – Search for Oversampling Ratio
##### 1) Search Grid
grid_search <- expand.grid(
over_rate = seq(1, 30, 2))
##### 2) AUC Value Collection
perf_overrate_1 <- numeric(length = dim(grid_search)[1])
##### 3) Training
for(i in 1:dim(grid_search)[1]){
## Oversampling Task
xgb.task.over <- oversample(xgb.task, rate = i)
## Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1)
)
## Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
## Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
## Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’
)
## Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc),
### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE
)
## AUC Value
perf_overrate_1[i] <- as.data.frame(trafoOptPath(res$opt.path))$auc.test.mean
}
##### 4) Result Table, the XXth model has the largest AUC
cat(“Model “, which.max(perf_overrate_1), ” is largest auc: “, max(perf_overrate_1), sep = “”)
##### 5) The parameters of the model with the largest AUC are as follows:
print(grid_search[which.max(perf_overrate_1), ])
Conclusion: Take the oversampling ratio as rate=15
3. Feature Selection
##### 1) Feature Selection
xgb.task.over <- oversample(xgb.task, rate = 15)
fv_time <- system.time(
fv <- generateFilterValuesData(
xgb.task.over,
method = c(‘information.gain’)
)
)
##### 2) Plot to View
# plotFilterValues(fv)
# plotFilterValuesGGVIS(fv)
##### 3) Extract 95% Information Gain
fv_data2 <- fv$data %>%
arrange(desc(information.gain)) %>%
mutate(info_gain_cul = cumsum(information.gain) / sum(information.gain))
fv_data2_filter <- fv_data2 %>% filter(info_gain_cul <= 0.9508198)
dim(fv_data2_filter)
fv_feature <- fv_data2_filter$name
xgb_tr3 <- xgb_tr2[, c(fv_feature, ‘TARGET’)]
xgb_te3 <- xgb_te2[, fv_feature]
##### 4) Write Data
write_csv(xgb_tr3, ‘C:/Users/Documents/kaggle/scs/xgb_tr3.csv’)
write_csv(xgb_te3, ‘C:/Users/Documents/kaggle/scs/xgb_te3.csv’)
Four, Parameter Tuning – Oversampling & Undersampling
1. Build Basic Task
library(mlr)
xgb.task <- makeClassifTask(data = xgb_tr3, target = ‘TARGET’)
2. Oversampling Grid Search – Search for Oversampling Ratio
##### 1) Search Grid
grid_search <- expand.grid(
over_rate = seq(1, 30, 2)
)
##### 2) AUC Value Collection
perf_overrate_1 <- numeric(length = dim(grid_search)[1])
##### 3) Training
for(i in 1:dim(grid_search)[1]){
## Oversampling Task
xgb.task.over <- oversample(xgb.task, rate = i)
## Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1)
)
## Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
## Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
## Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’
)
## Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc),
### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE
)
## AUC Value
perf_overrate_1[i] <- as.data.frame(trafoOptPath(res$opt.path))$auc.test.mean
}
##### 4) Result Table, the XXth model has the largest AUC
cat(“Model “, which.max(perf_overrate_1), ” is largest auc: “, max(perf_overrate_1), sep = “”)
##### 5) The parameters of the model with the largest AUC are as follows:
print(grid_search[which.max(perf_overrate_1), ])
Conclusion: Take rate = 19
3. Oversampling Grid Search – Search for Learning Rate
##### 1) Learning Task
xgb.task.over <- oversample(xgb.task, rate = 19)
##### 2) Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = 2 ^ (-(8:1)))
)
##### 3) Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
##### 4) Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
##### 5) Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’
)
##### 6) Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc),
### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE)
##### 7) AUC Value Collection
perf_eta_1 <- as.data.frame(trafoOptPath(res$opt.path))
AUC value is completely insensitive to eta.
4. Oversampling Grid Search – Search for Maximum Tree Depth
##### 1) Learning Task
xgb.task.over <- oversample(xgb.task, rate = 19)
##### 2) Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1),
makeDiscreteParam(‘max_depth’, values = seq(4, 25, 1)),
makeDiscreteParam(‘gamma’, values = 10))
##### 3) Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
##### 4) Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
##### 5) Build Learner xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’)
##### 6) Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc),
### You can choose according to evaluation criteria
control = xgb.ctrl, show.info = TRUE)
show.info = TRUE)
##### 7) AUC Value Collection
perf_maxdepth_1 <- as.data.frame(trafoOptPath(res$opt.path))
plot(perf_maxdepth_1$auc.test.mean)
Conclusion: AUC still monotonically increases with max_depth, but the speed of increase slows down; therefore, set max_depth = 15 (inflection point), no further increase.
5. Oversampling Grid Search – Search for Gamma
##### 1) Learning Task:
xgb.task.over <- oversample(xgb.task, rate = 15)
##### 2) Learning Parameters xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1),
makeDiscreteParam(‘max_depth’, values = 15),
makeDiscreteParam(‘min_child_weight’, values = 2),
makeDiscreteParam(‘gamma’, values = 2^(-3:3))
)
perf_gamma_1 <- numeric(length = length(xgb.ps$pars$gamma$values))
##### 3) Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
##### 4) Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
##### 5) Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’
)
##### 6) Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc), ### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE
)
##### 7) AUC Value Collection
perf_gamma_1 <- as.data.frame(trafoOptPath(res$opt.path))
Conclusion: AUC decreases monotonically with gamma, but the decrease is small.
6. Oversampling Grid Search Again – Gamma
##### 1) Learning Task
xgb.task.over <- oversample(xgb.task, rate = 19)
##### 2) Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1),
makeDiscreteParam(‘max_depth’, values = 15),
makeDiscreteParam(‘min_child_weight’, values = 1),
makeDiscreteParam(‘gamma’, values = seq(10, 45, by = 2))
)
perf_gamma_2 <- numeric(length = length(xgb.ps$pars$gamma$values))
##### 3) Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
##### 4) Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
##### 5) Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’
)
##### 6) Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc), ### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE
)
##### 7) AUC Value Collection
perf_gamma_2 <- as.data.frame(trafoOptPath(res$opt.path))
Conclusion: AUC decreases monotonically with gamma, set gamma = 23 temporarily.
7. Oversampling Grid Search – Search for min_child_weight
##### 1) Learning Task
xgb.task.over <- oversample(xgb.task, rate = 19)
##### 2) Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1),
makeDiscreteParam(‘max_depth’, values = 15),
makeDiscreteParam(‘gamma’, values = 23),
makeDiscreteParam(‘min_child_weight’, values = 2 ^ (0:5))
)
##### 3) Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
##### 4) Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
##### 5) Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’)
##### 6) Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc), ### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE)
##### 7) AUC Value Collection
perf_minchildweight_1 <- as.data.frame(trafoOptPath(res$opt.path))
Conclusion: AUC increases with min_child_weight, first rises and then falls, reaching the maximum value of 0.9191293 when min_child_weight = 2,
so set min_child_weight = 2.
8. Oversampling Grid Search – colsample_bytree
##### 1) Learning Task
xgb.task.over <- oversample(xgb.task, rate = 19)
##### 2) Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1),
makeDiscreteParam(‘max_depth’, values = 15),
makeDiscreteParam(‘min_child_weight’,values = 2),
makeDiscreteParam(‘gamma’, values = 23),
makeDiscreteParam(‘colsample_bytree’, values = seq(.1, 1, .1))
)
##### 3) Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
##### 4) Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
##### 5) Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’
)
##### 6) Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc), ### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE
)
##### 7) AUC Value Collection
perf_colsamplebytree_1 <- as.data.frame(trafoOptPath(res$opt.path))
Conclusion: AUC monotonically increases with colsample_bytree, so set colsample_bytree = 1.
9. Oversampling Grid Search – Search for Oversampling Ratio Again
##### 1) Search Grid
grid_search <- expand.grid(
over_rate = seq(1, 30)
)
##### 2) AUC Value Collection
perf_overrate <- numeric(length = dim(grid_search)[1])
##### 3) Training
for(i in 1:dim(grid_search)[1]){
## Oversampling Task
xgb.task.over <- oversample(xgb.task, rate = i)
## Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1),
makeDiscreteParam(‘max_depth’, values = 15),
makeDiscreteParam(‘min_child_weight’, values = 2),
makeDiscreteParam(‘gamma’, values = 23),
makeDiscreteParam(‘colsample_bytree’, values = 1)
)
## Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
## Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
## Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’
)
## Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc), ### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE
)
## AUC Value
perf_overrate[i] <- as.data.frame(trafoOptPath(res$opt.path))$auc.test.mean
}
##### 4) Result Table, the XXth model has the largest AUC
cat(“Model “, which.max(perf_overrate), ” is largest auc: “, max(perf_overrate), sep = “”)
##### 5) The parameters of the model with the largest AUC are as follows:
print(grid_search[which.max(perf_overrate), ])
Conclusion: AUC first rises and then falls with the increase of over_rate, peaking at 0.9232057, at this time over_rate = 25,
However, from the plot(perf_overrate) graph, it can be seen that setting over_rate = 18 is the most suitable.
10. Oversampling Grid Search – max_depth
##### 1) Learning Task
xgb.task.over <- oversample(xgb.task, rate = 18)
##### 2) Learning Parameters
xgb.ps <- makeParamSet(
makeDiscreteParam(‘eta’, values = .1),
makeDiscreteParam(‘max_depth’, values = seq(5, 29, 2)),
makeDiscreteParam(‘min_child_weight’, values = 2),
makeDiscreteParam(‘gamma’, values = 23),
makeDiscreteParam(‘colsample_bytree’, values = 1)
)
##### 3) Number of Learning Times
xgb.ctrl <- makeTuneMultiCritControlGrid()
##### 4) Model Description – Repeated Sampling Settings
xgb.rdesc <- makeResampleDesc(‘CV’, stratify = TRUE)
##### 5) Build Learner
xgb.learner = makeLearner(
‘classif.xgboost’,
predict.type = ‘prob’
)
##### 6) Learning
res <- tuneParamsMultiCrit(
learner = xgb.learner,
task = xgb.task.over,
resampling = xgb.rdesc,
par.set = xgb.ps,
measures = list(kappa, tpr, tnr, auc), ### You can choose according to evaluation criteria
control = xgb.ctrl,
show.info = TRUE)
##### 7) AUC Value Collection
perf_maxdepth_2 <- as.data.frame(trafoOptPath(res$opt.path))
Conclusion: AUC increases with max_depth, as the tree depth increases, the computational load increases, therefore set max_depth = 17 (inflection point), no further increase.
11. Train Multiple Models with the Above Parameters and Integrate Results
##### 0) Parameters
set.seed(1)grid_search <- expand.grid(
over_rate = sample(13:29, 5, replace = FALSE),
max_depth = sample(10:25, 5, replace = FALSE),
min_child_weight = sample(2:4, 2, replace = FALSE),
gamma = sample(25:40, 10, replace = FALSE),
colsample_bytree = sample(seq(.7, .95, .02), 10, replace = FALSE)
)
sample_ind <- sample(5000, 100, replace = FALSE)
xgb.pred <- list()
grid_search2 <- grid_search[sample_ind, ]
for (i in 1:nrow(grid_search2)){
##### 1) Build Learning Task
xgb.task.over <- oversample(xgb.task, rate = grid_search2[i, ‘over_rate’])
##### 2) Set Model Parameters
xgb.ps <- list(
eta = .1,
max_depth = grid_search2[i, ‘max_depth’],
min_child_weight = grid_search2[i, ‘min_child_weight’],
gamma = grid_search2[i, ‘gamma’],
colsample_bytree = grid_search2[i, ‘colsample_bytree’]
)
##### 3) Build Learner
xgb.lrn.over = makeLearner(
cl = ‘classif.xgboost’,
predict.type = ‘prob’,
fix.factors.prediction = FALSE,
par.vals = xgb.ps
)
##### 4) Train Model
xgb.train.over <- train(
learner = xgb.lrn.over,
task = xgb.task.over
)
##### 5) Prediction
xgb.pred[[i]] <- predict(xgb.train.over, newdata = xgb_te3)
}
##### Ensemble Prediction Results
xgb.pred1 <- list()
for (i in 1:nrow(grid_search2)){
xgb.pred1[[i]] <- xgb.pred[[i]]$data$prob.1
}
xgb.pred2 <- matrix(unlist(xgb.pred1), ncol = 100)
xgb.pred3 <- data.frame(prob1 = apply(xgb.pred2, 1, mean))
##### Output Results
write_csv(xgb.pred3, “C:/Users/Administrator/kaggle/scs/xgb.pred.over1.csv”)
2017 R Language Development Report (Domestic)
Collection of Historical Articles from R Language Chinese Community (Author Edition)
Collection of Historical Articles from R Language Chinese Community (Type Edition)
Reply with keywords in the official account to learn
Reply R Quick Start of R Language and Data Mining Reply Kaggle Case Detailed Explanation of Top Ten Kaggle Cases (In Progress)Reply Text Mining Step-by-Step Guide to Text MiningReply Visualization Application of R Language Visualization in Business Scenarios Reply Big Data Free Video Tutorials on Big Data Series Reply Quantitative Investment Zhang Dan teaches you how to use R language for quantitative investment Reply User Portrait JD big data, revealing user portraitsReply Data Mining Explanation and Application of Common Data Mining Algorithm PrinciplesReply Machine Learning AI Series on Machine Learning and PracticeReply Crawler Practical Case Sharing of R Language Crawler