Advancing in Machine Learning: Ensemble Learning to the Peak

1. What is Ensemble Learning?

Ensemble learning is an important branch of machine learning. Ensemble learning, or ensemble learning, is a machine learning method that trains multiple different weak classifiers using sample data, and then integrates these weak classifiers into a powerful classifier.

The basic structure of ensemble learning is as follows: first, a set of individual learners is generated, and then a certain strategy is used to combine them. The ensemble model is shown in the figure below:

Advancing in Machine Learning: Ensemble Learning to the Peak

In the ensemble model shown above, if the individual learners belong to the same category, such as all being decision trees or all being neural networks, it is called a homogeneous ensemble; if the individual learners include multiple types of learning algorithms, such as both decision trees and neural networks, it is called a heterogeneous ensemble.

Homogeneous Ensemble: Individual learners are referred to as “base learners”, and the corresponding learning algorithm is called the “base learning algorithm”. (Currently, mainstream ensemble learning algorithms) Heterogeneous Ensemble: Individual learners are referred to as “component learners” or simply “individual learners”.

We have mentioned that the generalization performance of the ensemble must be better than that of a single learner. While unity is strength, the law of the short board also plays a role. So how can we achieve this? This leads us to two important concepts in ensemble learning: accuracy and diversity. Accuracy refers to the individual learners not being too poor, needing a certain level of accuracy; diversity means that the outputs of the individual learners should have differences. The following three examples can easily illustrate this point: high accuracy and high diversity can significantly enhance ensemble performance.

Advancing in Machine Learning: Ensemble Learning to the Peak

When all individual learners are constructed using the same algorithm, how can we ensure the diversity among the learners? There are two approaches:

  • Each time an individual learner is trained, a sampling of the original dataset is performed to obtain a different dataset as the current training set. Each training sample can appear multiple times or not at all in the sampled training set. After this training, we can obtain different individual learners that are not interdependent. Bagging and Random Forest are representatives of this approach.

  • By continuously updating the weights, a weak learner is used to compensate for the “shortcomings” of the previous weak learner, thereby serially constructing a stronger learner that can make the target function value sufficiently small. This approach is represented by the Boosting series of algorithms, including Adaboost, GBDT, and XGBOOST.

2. What Problems Can Ensemble Learning Solve?

Before learning EL, we need to understand why we introduce EL, that is, what problems EL can solve. Most of us have encountered practical projects where, due to the complexity of the environment, the presence of noise, and other random uncertainties, using a single recognition method, parameter, or feature cannot achieve our expected results. At this point, we need to consider applying multiple methods, perspectives, and parameters to solve the above problems. Sometimes, due to the high dimensionality of the parameter characteristics, the model may encounter the problem of the “curse of dimensionality”. Thus, ensemble learning comes into play; it can solve a series of problems mentioned above. Therefore, mastering ensemble learning is crucial.

Advancing in Machine Learning: Ensemble Learning to the Peak

3. Classification of Ensemble Learning

EL can be mainly divided into two categories. According to the relationships between weak classifiers, it can be divided into boosting algorithms and bagging algorithms. Each has its commonly used EL algorithms:

Boosting:

  • Adaboost
  • GBDT
  • XGBoost

Bagging:

  • Random Forest

4. Core Problems of Ensemble Learning

After understanding the definition of ensemble learning and the problems it aims to solve, we should know how to implement ensemble learning. The implementation of ensemble learning mainly needs to consider two core issues:

1. How to train multiple different weak classifiers;

2. How to integrate multiple different weak classifiers into a strong classifier.

Regarding the above core issues, there are different ideas and methods, among which boosting and bagging will have different approaches to solving the two types of problems. The specific content will be detailed in the next article.

Leave a Comment