Introduction to Machine Learning

Both linear models and decision trees are important types of machine learning.We have spent a lot of time learning linear regression; do you feel like you have opened the door to machine learning?Next, we continue to explore machine learning, from understanding to application.

1. The Role of Prediction and Classification in Machine Learning

The purpose of machine learning is to predict and classify unknown data based on existing data. Feature extraction needs to be done manually, and then the feature vectors are handed over to the machine for training.

For example, the predictive role of the linear regression model. In the following example, the x-axis represents people’s salt intake, and the y-axis represents their running speed. Based on existing data, machine learning can use the least squares method or other methods to fit the best line model. When a new sample appears, knowing its salt intake allows us to infer its running speed. This is the predictive role of machine learning.

For example, the classification role of decision trees. The video suggests that based on a series of questions, decision trees can classify whether people like StatQuest. The decision tree will classify whether people like StatQuest based on a series of judgment nodes.
Many other models can be used for prediction and classification. In machine learning, besides linear models and decision trees, there are many complex and advanced methods, such as deep learning convolutional neural networks or other interesting models. The development of machine learning methods is rapid, with new methods emerging almost every year. Regardless of the model, its performance in testing is very important, meaning that the model’s extrapolation ability is crucial.

2. Training and Testing Datasets in Machine Learning

Training Data: The Role is to Train the Model. The author believes that the significance of training data is to provide a new machine with a batch of data, allowing it to learn patterns from the data, thereby achieving the purpose of machine learning and training.
Testing Data: The Role is to Test the Model and Validate its Generalization. The author believes that testing data is a different batch of data from the training data, applying the learned patterns to new data to validate the generalization of the new machine.
The Significance of Training Data → Testing Data: In machine learning, we first need to train the model on the training data to find a suitable model; then, we test that model on the testing data to evaluate its effectiveness. If the model performs poorly on the testing data, it indicates that the model’s conclusions are limited to the training data, and it has no generalization ability.

3. Why Separate Training and Testing Sets?

For example: In the following sample dataset, the x-axis represents people’s salt intake, and the y-axis represents their running speed (do not get caught up in the real relationship between salt intake and running speed; just know that we predict y-axis data based on x-axis data).The red samples correspond to training samples.

First, in the training samples, we can fit a black straight line or a green curve. Clearly, the green curve fits better because it can predict the training sample data with 100% accuracy. However, the main purpose of machine learning is to perform well on external data (testing dataset), so we cannot hastily conclude based on results from the training dataset.

Introduction to Machine Learning

Next, we validate the existing model on another batch of testing data. To highlight the prediction role of the same model in both training and testing sets, we keep the training samples as a background, the blue samples correspond to testing samples. In the training samples, we predict the testing samples using the trained straight line model (black dashed line) and the curve model (green dashed line).

Introduction to Machine Learning

Then, we explore the prediction effects of the two models on the testing dataset.

In the straight line model (black dashed line), based on people’s salt intake, we predict their running speed. We calculate the absolute sum of the differences between the predicted speed and the real speed, which is the sum of residuals.
Similarly, in the curve model (green dashed line), based on people’s salt intake, we predict their running speed. We calculate the absolute sum of the differences between the predicted speed and the real speed, which is the sum of residuals.
Comparison of performance in the testing dataset between the straight line model (black dashed line) and the curve model (green dashed line): comparing the sum of residuals for the two models. The sum of residuals for the green curve > the sum of residuals for the black line, indicating that the green curve performs worse in the testing data than the black line.

4. Bias-Variance Tradeoff

In the previous example, the curve performed better than the line in the training samples, but worse than the line in the testing samples [the difference for the black line is smaller than that for the green curve].

Performing well in training samples but poorly in testing samples is known as the bias-variance tradeoff; keep this knowledge point in mind, as we will delve deeper into it later.
In practical machine learning, we need to consider the machine’s performance on the training dataset and its generalization effect on the testing dataset.

5. Summary

In this section, we first understood what machine learning is and learned some important concepts in machine learning. Compared to the predictive performance of machine learning on the testing dataset, we hope that machine learning can show excellent prediction and classification capabilities on more datasets beyond the training dataset. For details on how to classify training and testing data, see the next section.

Reference Video: https://www.youtube.com/watch?v=Gv9_4yMHFhI

Editor: Lü Qiong

Reviewer: Luo Peng

Introduction to Machine Learning

1. The Role of Prediction and Classification in Machine Learning

2. Training and Testing Datasets in Machine Learning

3. Why Separate Training and Testing Sets?

4. Bias-Variance Tradeoff

5. Summary

Leave a Comment Cancel reply