Understanding Ten Models: Lasso, Bayesian, KNN, Logistic, Decision Trees, Random Forests, SVM, Neural Networks, XGBoost, LightGBM

Machine Learning

Machine learning is an important branch of artificial intelligence, playing an increasingly significant role in areas such as data analysis, image recognition, and natural language processing in recent years. The basic concept of machine learning revolves around how to enable computers to learn and predict using data. R, as a powerful tool for statistical analysis and graphical representation, holds a place in the field of machine learning due to its rich packages and flexible data processing capabilities. Today we begin the first article on R language machine learning, focusing on data preparation and batch installation of packages.

Definition and Working Principle of Machine Learning

Machine learning is a discipline that studies how to enable computer systems to learn from data and improve performance. It achieves automatic processing and analysis of data by training models to identify patterns, predict trends, and make decisions.

Machine learning algorithms learn from large amounts of data, extracting useful features and building models to predict new data. These models can be continuously optimized to adapt to different types of data and tasks. Common machine learning algorithms include KNN, decision trees, random forests, Bayesian methods, etc.

First, we open Rstudio. You are probably already familiar with installing and loading individual packages:

rm(list=ls()) # Remove all variable data

install.packages() # Install package

library() # Load package

Then, to read and write Excel, we need to use the readxl and writexl packages:

# Read Excel data

install.packages(“readxl”)

library(readxl) # Load package

data <- read_excel(“path/example_data.xlsx”)

# Write Excel data export

install.packages(“writexl”)

library(writexl) # Load package

write_xlsx(data.all, “path/summary.xlsx”)

Today, we will use a sample data set consisting of 1000 cases. We will read it into Rstudio for processing using readxl.

Understanding Ten Models: Lasso, Bayesian, KNN, Logistic, Decision Trees, Random Forests, SVM, Neural Networks, XGBoost, LightGBM

We can

# Define the list of packages needed for machine learning, check if each package is already installed, and if not, install and load it

packages<-c(“readxl”,”ggplot2″,”caret”,”lattice”,”gmodels”,

“glmnet”,”Matrix”,”pROC”,”Hmisc”,”rms”,

“tidyverse”,”Boruta”,”car”,”carData”,

“rmda”,”dplyr”,”rpart”,”rattle”,”tibble”,”bitops”,

“probably”,”tidymodels”,”fastshap”,

“shapviz”,”e1071″)

for(pkg in packages) {

if (!require(pkg, quietly = TRUE)) {

install.packages(pkg, dependencies = TRUE)

library(pkg, character.only = TRUE)

}

# Use library() function to load multiple packages at once

lapply(packages,library,character.only = TRUE)

Alternatively, we can

# Directly define and batch install packages

packages<-c(“readxl”,”ggplot2″,”caret”,

“lattice”,”gmodels”,”glmnet”,”Matrix”,”pROC”,

“Hmisc”,”rms”,”tidyverse”,”Boruta”,”car”,

“carData”,”rmda”,”dplyr”,”rpart”,”rattle”,”tibble”,

“bitops”,”probably”,”tidymodels”,”fastshap”,

“shapviz”,”e1071″)

install.packages(c(“readxl”,”ggplot2″,”caret”,

“lattice”,”gmodels”,”glmnet”,”Matrix”,”pROC”,

“Hmisc”,”rms”,”tidyverse”,”Boruta”,”car”,”carData”,

“rmda”,”dplyr”,”rpart”,”rattle”,”tibble”,”bitops”, “probably”,”tidymodels”,”fastshap”,

“shapviz”,”e1071″))

install.packages(packages)

# Use library() function to load multiple packages at once

lapply(packages,library,character.only = TRUE)

Furthermore, we can perform simple analysis:

# Independent samples t-test

t.test(data$indicator1 ~ outcome, data = data, var.equal = TRUE)

t.test(data$indicator2 ~ outcome, data = data, var.equal = TRUE)

t.test(data$indicator3 ~ outcome, data = data, var.equal = TRUE)

# Chi-square test

CrossTable(data$outcome,data$indicator8,expected = T,chisq = T,fisher = T, mcnemar = T, format = “SPSS”)

# Logistic regression

data$Group<-as.factor(data$outcome)

model1 <- glm(Group ~ indicator1, data = data, family = “binomial”)

summary(model1)

model2 <- glm(Group ~ indicator1+indicator2, data = data, family = “binomial”)

summary(model2)

model3 <- glm(Group ~ indicator1+indicator2+indicator3, data = data, family = “binomial”)

summary(model3)

# ROC curve

roc1 <- roc(data$outcome,data$indicator1);roc1

roc2 <- roc(data$outcome,data$indicator2);roc2

roc3 <- roc(data$outcome,data$indicator3);roc3

plot(roc1,

max.auc.polygon=FALSE, # Fill the entire image

smooth=F, # Draw unsmoothed curve

main=”Comparison of ROC curves”, # Add title

col=”red”, # Curve color

legacy.axes=TRUE) # Make the x-axis range from 0 to 1, representing 1-specificity

plot.roc(roc2,

add=T, # Add curve

col=”orange”, # Curve color is red

smooth = F) # Draw unsmoothed curve

plot.roc(roc3,

add=T, # Add curve

col=”yellow”, # Curve color is red

smooth = F) # Draw unsmoothed curve

# Nomogram

dd<-datadist(data)

options(datadist=”dd”)

data$Group<-as.factor(data$outcome)

f_lrm<-lrm(Group~indicator1+indicator2+indicator3+indicator4+indicator5+indicator6+indicator7+indicator8,data=data)

summary(f_lrm)

par(mgp=c(1.6,0.6,0),mar=c(5,5,3,1))

nomogram <- nomogram(f_lrm,fun=function(x)1/(1+exp(-x)),

fun.at=c(0.01,0.05,0.2,0.5,0.8,0.95,1),

funlabel =”Prob of outcome”,

conf.int = F,

abbrev = F )

plot(nomogram)

Training Set and Test Set

The training set and test set are commonly used data partitioning methods in deep learning technology. They automatically divide the data into training and test sets for the evaluation of model development.

The training set is a group of data prepared for training the model, which is usually labeled by humans to indicate they possess specific attributes. Generally, to train an effective model, the sample size of the training set should be large enough and should have sufficient diversity to represent all possible scenarios in their dataset.

The test set is a group of samples used to test the trained model, verifying the accuracy and correctness of the trained model. It essentially consists of unknown data samples used to evaluate model performance without the bias of training set sampling, providing a more reliable assessment of the model’s performance.

So how do we partition the test and training sets?

# Stratified sampling to partition training and test sets

set.seed(123)

train <- sample(1:nrow(data),nrow(data)*7/10) # Take 70% for training set, remaining 30% for test set

# Data reading, splitting, and combining

Train <- data[train,] # Define training set data

Test <- data[-train,] # Define test set data

All <- rbind(Train, Test) # Combine split data

# Write Excel data export

install.packages(“writexl”)

library(writexl) # Load package

write_xlsx(Train, “C:/Users/L/Desktop/Train.xlsx”)

write_xlsx(Test, “C:/Users/L/Desktop/Test.xlsx”)

write_xlsx(All, “C:/Users/L/Desktop/All.xlsx”)

Leave a Comment Cancel reply