Pros and Cons of the Top 10 Machine Learning Algorithms

1. Logistic Regression

The binary logistic regression model is a classification model represented by the conditional probability distributionP(Y|X), with the form of a parameterized logistic distribution. Here, the random variable X takes real values, and the random variable Y takes values of 1 or 0. The model parameters can be estimated using a supervised method.

Advantages:

1. Low computational cost, easy to understand and implement;

2. Suitable for scenarios requiring classification probabilities;

3. Good robustness to small data noise, not significantly affected by slight multicollinearity.

Disadvantages:

1. Prone to underfitting, classification accuracy may not be high;

2. Performs poorly when data is missing and features are large.

2. Support Vector Machine

For two-class linear separable learning tasks, SVM finds a hyperplane that maximizes the margin to separate the two classes, where the maximum margin ensures that the hyperplane has the best generalization ability.

Advantages:

1. Can solve ML problems in small sample cases;

2. Can improve generalization performance;

3. Can solve high-dimensional problems, avoiding the curse of dimensionality;

4. Can handle nonlinear problems;

5. Can avoid issues with neural networks regarding structure selection and local minima.

The choice of parameters C and g affects classification performance:

C is the penalty coefficient, the larger C is, the higher the cross-validation score, but it is prone to overfitting;

g is the rate at which the kernel function approaches 0, the smaller g is, the faster the function declines, leading to higher cross-validation scores, also prone to overfitting.

Disadvantages:

1. Sensitive to missing data;

2. No universal solution for nonlinear problems, careful selection of kernel functions is necessary.

The main advantages of SVM algorithm are:

1) Effective in solving classification and regression problems with high-dimensional features, still performs well when the feature dimension exceeds the number of samples.

2) Only a portion of support vectors are used to make decisions for the hyperplane, no need to rely on all data.

3) A variety of kernel functions can be used, allowing flexibility in solving various nonlinear classification and regression problems.

4) High classification accuracy and strong generalization ability when sample size is not massive.

The main disadvantages of SVM algorithm are:

1) If feature dimensions far exceed the number of samples, SVM performs generally.

2) When sample size is very large and the kernel function maps to a very high dimension, the computation becomes excessive, making it unsuitable for large datasets.

3) There is no universal standard for selecting kernel functions for nonlinear problems, making it difficult to choose a suitable kernel function.

4) SVM is sensitive to missing data.

1) It is generally recommended to normalize the data before training, and data in the test set also needs to be normalized.

2) When the number of features is very high or the number of samples is far less than the number of features, using a linear kernel yields good results, and only the penalty coefficient C needs to be selected.

3) When selecting a kernel function, if linear fitting is poor, it is generally recommended to use the default Gaussian kernel ‘rbf’. At this point, we mainly need to rigorously tune the penalty coefficient C and kernel function parameter γ through multiple rounds of cross-validation to select appropriate values.

4) Theoretically, Gaussian kernel should not perform worse than linear kernel, but this theory is based on spending more time tuning parameters. Therefore, we should use linear kernel whenever possible to solve problems.

3. Decision Tree

A heuristic algorithm, the core is to apply information gain and other criteria at each node of the decision tree to select features and recursively construct the decision tree.

Advantages:

1. Low computational complexity, easy to understand and interpret, making it easy to comprehend the meaning expressed by the decision tree;

2. Simple data preprocessing stage, and can handle missing data;

3. Can handle both data-type and categorical attributes, and can construct decision trees for datasets with many attributes;

4. It is a white-box model; given an observation model, it is easy to infer the corresponding logical expression based on the generated decision tree;

5. Can produce feasible and effective classification results for large datasets in a relatively short time.

6. Can construct decision trees for datasets with many attributes.

Disadvantages:

1. For datasets where the number of samples in each category is inconsistent, the result of information gain tends to favor attributes with more values;

2. Sensitive to noisy data;

3. Prone to overfitting;

4. Ignores the correlation between attributes in the dataset;

5. Difficulties in handling missing data.

Advantages of Decision Trees:

1) Simple and intuitive, the generated decision tree is very intuitive.

2) Requires little preprocessing, no need for prior normalization, and can handle missing values.

3) The cost of using decision trees for prediction is O(log_2m), where m is the number of samples.

4) Can handle both discrete and continuous values. Many algorithms focus only on discrete or continuous values.

5) Can handle multi-dimensional output classification problems.

6) Compared to black-box classification models like neural networks, decision trees can provide good logical explanations.

7) Can use cross-validation pruning to select models, thereby improving generalization ability.

8) Good fault tolerance for outliers, high robustness.

Disadvantages of Decision Tree Algorithm:

1) Decision tree algorithms are very prone to overfitting, leading to weak generalization ability. This can be improved by setting a minimum sample size for nodes and limiting the depth of the decision tree.

2) The decision tree structure can change drastically due to minor changes in the samples. This can be addressed through ensemble learning methods.

3) Finding the optimal decision tree is an NP-hard problem; we generally use heuristic methods, making it easy to fall into local optima. This can be improved through ensemble learning methods.

4) Some complex relationships are difficult for decision trees to learn, such as XOR. This typically requires switching to neural network classification methods.

5) If certain features have a disproportionate sample ratio, the generated decision tree is likely to favor these features. This can be improved by adjusting sample weights.

4. KNN Algorithm

A lazy classification method that finds the k closest training objects to the test object from the training set and assigns the dominant class from these k training objects to the test object.

Advantages:

1. Simple and effective, easy to understand and implement;

2. Low cost for retraining (changes in class system and training set);

3. Computational time and space are linear to the scale of the training set;

4. Error rate progressively converges to the Bayesian error rate, can be used as an approximation of Bayesian;

5. Suitable for handling multi-modal classification and multi-label classification problems;

6. More suitable for classification samples with a lot of overlap or intersection in class domains;

Disadvantages:

1. A lazy learning method, slower than some active learning algorithms;

2. Large computational load, requires trimming of sample points;

3. Performs poorly on imbalanced datasets; weighted voting methods can be used to improve performance;

4. The choice of k value greatly affects classification performance; a small k value is sensitive to noise, requiring estimation of the optimal k value.

5. Limited interpretability, large computational load.

Main Advantages of KNN:

1) The theory is mature, the concept is simple, can be used for both classification and regression.

2) Can be used for nonlinear classification.

3) Training time complexity is lower than that of algorithms like SVM, only O(n).

4) Compared to algorithms like Naive Bayes, it makes no assumptions about data, has high accuracy, and is not sensitive to outliers.

5) Since the KNN method mainly relies on surrounding limited neighboring samples instead of using discriminative class domain methods to determine the class, it is more suitable for classification samples with a lot of overlap or intersection in class domains.

6) This algorithm is more suitable for automatic classification of class domains with a large sample size, while using this algorithm for class domains with small sample sizes is more prone to misclassification.

Main Disadvantages of KNN:

1) Large computational load, especially when the number of features is very high.

2) When samples are imbalanced, the prediction accuracy for rare classes is low.

3) Models like KD-trees and ball trees require a lot of memory.

4) Using a lazy learning method, which essentially does not learn, leads to slower prediction times compared to algorithms like logistic regression.

5) Compared to decision tree models, KNN models have limited interpretability.

5. Naive Bayes Algorithm

The classification principle of the Bayesian classifier is to use the prior probabilities of each category, then use Bayes’ theorem and the independence assumption to calculate the attribute class probabilities and the posterior probability of the object, i.e., the probability that the object belongs to a certain class, choosing the class with the maximum posterior probability as the class to which the object belongs.

Advantages:

1. Solid mathematical foundation, stable classification efficiency, easy to explain;

2. Requires very few estimated parameters, not very sensitive to missing data;

3. No complex iterative solving framework is needed, suitable for large-scale datasets.

Disadvantages:

1. The independence assumption between attributes often does not hold (consider using clustering algorithms to cluster attributes with high correlation);

2. Requires knowledge of prior probabilities, resulting in a classification decision error rate.

Main Advantages of Naive Bayes:

1) The Naive Bayes model originates from classical mathematical theory, with stable classification efficiency.

2) Performs well on small-scale data, can handle multi-class tasks, suitable for incremental training, especially when data volume exceeds memory, we can train incrementally in batches.

3) Not very sensitive to missing data, the algorithm is relatively simple, commonly used in text classification.

Main Disadvantages of Naive Bayes:　

1）Theoretically, the Naive Bayes model has the minimum error rate compared to other classification methods. However, this is not always the case in practice because the Naive Bayes model assumes independence between attributes given the output class, which often does not hold in practical applications. When the number of attributes is large or the correlation between attributes is significant, the classification effect is poor. In contrast, when attribute correlation is low, Naive Bayes performs best. Algorithms like semi-Naive Bayes have been developed to moderately improve performance by considering partial correlations.

2) Requires knowledge of prior probabilities, and prior probabilities often depend on assumptions, with many possible models for the assumptions. Therefore, sometimes the predictive performance may be poor due to the assumptions of the prior model.

3) The classification decision has a certain error rate since we determine the posterior probabilities based on prior and data.

4) Sensitive to the expression form of input data.

6. Random Forest Algorithm

Main Advantages of RF:

1）Training can be highly parallelized, providing speed advantages for large sample training in the era of big data. This is its most significant advantage.

2）Due to the ability to randomly select features for decision tree node splits, it can still efficiently train models even when the sample feature dimensions are high.

3）After training, it can provide the importance of each feature to the output.

4）Due to random sampling, the variance of the trained model is small, and the generalization ability is strong.

5）Compared to boosting series like Adaboost and GBDT, RF is relatively simple to implement.

6）Not sensitive to missing features.

Main Disadvantages of RF:

1）On certain noisy sample sets, the RF model is prone to overfitting.

2) Features with many value partitions can have a greater impact on RF’s decisions, thereby affecting the fitting model’s performance.

7. AdaBoost Algorithm

The boosting method starts from weak learning algorithms, repeatedly learning to obtain a series of weak classifiers (i.e., base classifiers), and then combines these weak classifiers to form a strong classifier. Most boosting methods change the probability distribution of the training dataset (the weight distribution of the training data), calling weak learning algorithms to learn a series of weak classifiers based on different training data distributions.

Advantages:

1. High classification accuracy;

2. Various methods can be used to construct sub-classifiers, and the Adaboost algorithm provides a framework;

3. Simple, and no feature selection is required;

4. Does not lead to overfitting.

Disadvantages:

1. If misclassified samples are repeatedly weighted too heavily, it can lead to degradation issues in classifier selection (needs to improve weight update methods);

2. Data imbalance issues can lead to a sharp decline in classification accuracy;

3. Algorithm training is time-consuming and difficult to expand;

4. Issues such as overfitting and low robustness may exist.

Adaboost’s main advantages are:

1）When used as a classifier, Adaboost has high classification accuracy.

2）Under the Adaboost framework, various regression classification models can be used to construct weak learners, offering great flexibility.

3）As a simple binary classifier, it is easy to construct and the results are understandable.

4）Not prone to overfitting.

Main Disadvantages of Adaboost:

1）Sensitive to outliers; outliers may gain high weights during iterations, affecting the final strong learner’s prediction accuracy.

8. GBDT

Main Advantages of GBDT:

1) Can flexibly handle various types of data, including continuous and discrete values.

2) Predictive accuracy can be relatively high with less tuning time, especially compared to SVM.

3）Uses robust loss functions, showing strong robustness to outliers, such as Huber and Quantile loss functions.

Main Disadvantages of GBDT:

1) Due to the dependency between weak learners, parallel training of data is difficult. However, partial parallelization can be achieved through self-sampling SGBT.

9. XGBoost Algorithm

1.XGBoost has advantages over GBDT:

It incorporates the complexity of the tree model into the regularization term to avoid overfitting, thus improving generalization performance compared to GBDT.

The loss function is expanded using Taylor series, utilizing both first and second derivatives to speed up optimization.

GBDT only supports CART as the base learner, while XGBoost also supports linear classifiers as base learners.

Introduces feature subsampling, like Random Forest, to avoid overfitting and reduce computation.

When searching for the optimal split point, it implements an approximate greedy algorithm to speed up and reduce memory usage, and also considers handling missing values in sparse datasets.

XGBoost supports parallel processing. The parallelization in XGBoost is not model generation parallelism, but rather on features, storing features in memory in a block format after sorting, allowing for reuse of this structure in subsequent iterations. This block structure also makes parallelization possible, and during node splitting, it calculates the gain for each feature, ultimately selecting the feature with the highest gain for splitting, allowing for multi-threaded calculations of gains across features.

2. The shortcomings compared to LightGBM:

XGBoosting uses pre-sorting, which involves pre-sorting features of nodes before iterations to traverse and select the optimal split point. When the data volume is large, the greedy method becomes time-consuming. LightGBM uses histogram algorithms, which occupy less memory and have lower complexity for data splitting.

XGBoosting generates decision trees level-wise, splitting the leaves of the same layer, allowing for multi-threaded optimization without easily overfitting. However, many leaf node splits may have low gains, leading to unnecessary overhead; LightGBM uses deep optimization and a leaf-wise growth strategy, selecting the node with the highest gain from the current leaf for splitting in a cyclic manner, which can lead to deeper decision trees and overfitting. Thus, a threshold is introduced to limit overfitting.

10. Artificial Neural Networks

Advantages:

1. High classification accuracy, strong parallel distributed processing capabilities, and robust learning capabilities;

2. Strong robustness and fault tolerance to noise, can closely approximate complex nonlinear relationships, and possesses associative memory functions.

Disadvantages:

1. Neural networks require a large number of parameters, such as network topology, weights, and initial threshold values;

2. The learning process cannot be observed, and the output results are difficult to interpret, affecting the credibility and acceptability of the results;

3. Learning time can be excessively long, and may not even achieve the desired learning objectives.

Source:Machine Learning Algorithms Explained

Editor / OnlyTulipGarden

Proofreader / Fan Ruiqiang

Reviewer / Zhang Zhihong