Every algorithm has its applicable range, and understanding its pros and cons can help avoid errors caused by inappropriate use. This article summarizes the pros and cons of common machine learning algorithms for reference. The sources are from “Machine Learning: Using R, Tidyverse, and mlr” (Algorithms 1 to 17) and “Neural Networks: Implementation in R” (Algorithm 18), both using R language. The red brackets for Algorithms 1 to 17 indicate the classification of algorithms within the books.
Corresponding English terms:
kNN: k-Nearest Neighbors
LDA: Linear Discriminant Analysis
QDA: Quadratic Discriminant Analysis
SVM: Support Vector Machine
rpart: Recursive Partitioning (a decision tree algorithm)
XGBoost: Extreme Gradient Boosting
GAM: Generalized Additive Model (a nonlinear regression algorithm)
LASSO: Least Absolute Shrinkage and Selection Operator
OLS: Ordinary Least Squares
PCA: Principal Component Analysis
t-SNE: t-Distributed Stochastic Neighbor Embedding
UMAP: Uniform Manifold Approximation and Projection
SOM: Self-Organizing Map
LLE: Locally Linear Embedding
OPTICS: Ordering Points to Identify Clustering Structure
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
1. kNN Algorithm (Classification or Regression)
Pros:
1. The algorithm is simple and easy to understand.
2. There are no computational costs during the learning process; all computations are done during prediction.
3. It makes no assumptions about the data, such as how it is distributed.
Cons:
1. Cannot directly handle categorical variables (must re-encode categorical variables or use different distance measures).
2. When the training set is large, the workload of calculating the distance between new data and all samples in the training set can be very large.
3. The model cannot explain the true relationships in the data.
4. Noisy data and outliers have a significant impact on prediction accuracy.
5. In high-dimensional datasets, the performance of the kNN algorithm is often poor. In short, in high dimensions, the distance between two samples appears almost the same, making it very difficult to find the nearest neighbors.
2. Logistic Regression Algorithm (Classification)
Pros:
1. Can handle both continuous and categorical predictor variables.
2. Model parameters are easy to understand.
3. Predictor variables do not need to be assumed to follow a normal distribution.
Cons:
1. Logistic regression will not work when categories are completely separated.
2. It assumes that categories are linearly separable; in other words, it assumes that a hyperplane in n-dimensional space (where n is the number of predictor variables) can be used for classification. If a surface is needed for classification, logistic regression will perform poorly compared to other algorithms.
3. It assumes that there is a linear relationship between each predictor variable and the log odds. For example, if low and high values of a predictor variable belong to the same category while medium values belong to another category, this linear relationship will be broken.
3. LDA and QDA Algorithms (Classification)
Pros:
1. Can compress high-dimensional feature space into a more manageable dimensional space.
2. Can be used for classification or as a preprocessing (dimensionality reduction) technique for other classification algorithms, potentially improving model performance on other datasets.
3. QDA can learn curved decision boundaries between categories (LDA cannot).
Cons:
1. Can only handle continuous predictor variables (although re-encoding categorical variables as numbers may help in some cases).
2. Assumes that data follows a normal distribution among the predictor variables; if the data distribution does not meet this assumption, performance will be affected.
3. LDA can only learn linear decision boundaries between categories (QDA can learn non-linear boundaries).
4. LDA assumes that the covariance of the categories is the same; otherwise, performance will be affected (QDA does not have this restriction).
5. QDA is more flexible than LDA, making it more prone to overfitting.
4. Naive Bayes Algorithm (Classification)
Pros:
1. Can handle both continuous and categorical predictor variables.
2. Has low training overhead.
3. Generally performs well in topic classification based on words contained in documents.
4. There are no hyperparameters to tune.
5. Outputs the probability that a new sample belongs to a certain category.
6. Can handle missing data.
Cons:
1. Assumes that continuous predictor variables are normally distributed (generally holds), otherwise, model performance will be affected.
2. Assumes that predictor variables are independent of each other, but this condition is often not met. If this assumption is severely violated, model performance will be affected.
5. SVM Algorithm (Classification)
Pros:
1. Very good at learning complex non-linear decision boundaries.
2. Performs well across various tasks.
3. Makes no assumptions about the distribution of predictor variables.
Cons:
1. One of the most computationally expensive algorithms.
2. Requires tuning multiple hyperparameters simultaneously.
3. Can only handle continuous predictor variables (although re-encoding categorical predictor variables as discrete values may help in some cases).
6. Decision Tree Algorithm (Classification)
Pros:
1. The construction of trees is straightforward, and each tree can be easily interpreted.
2. Can handle both categorical and continuous predictor variables.
3. Makes no assumptions about the distribution of predictor variables.
4. Can reasonably handle missing values.
5. Can handle continuous variables of different scales.
Cons:
Each tree is prone to overfitting, so they are rarely used alone.
7. Random Forest and XGBoost Algorithms (Classification or Regression)
Pros:
1. Can handle both categorical and continuous predictor variables (although XGBoost requires some numerical encoding).
2. Makes no assumptions about the distribution of predictor variables.
3. Can reasonably handle missing values.
4. Can handle continuous variables of different scales.
5. Ensemble learning algorithms can significantly improve the performance of models based on single decision trees, and XGBoost is particularly good at reducing bias and variance.
Cons:
1. Compared to the rpart algorithm, while Random Forest reduces variance, it does not reduce bias (XGBoost can reduce both variance and bias).
2. XGBoost has many hyperparameters and generates decision trees sequentially, which can lead to significant computational overhead.
8. Linear Regression Algorithm (Regression)
Pros:
1. The generated model is very easy to understand.
2. Can handle both continuous and categorical predictor variables.
3. Very low computational overhead.
Cons:
1. Has strong assumptions about the data, such as homoscedasticity, linearity, and distribution of residuals (if these assumptions are violated, model performance may be diminished).
2. Can only learn linear relationships in the data.
3. Cannot handle missing data.
Other algorithms’ pros and cons will be updated later!