1. Overview of Machine Learning

1) What is Machine Learning

Artificial Intelligence (Artificial intelligence) is a new scientific discipline that studies and develops theories, methods, technologies, and application systems to simulate, extend, and enhance human intelligence. It is a broad and general concept, and the ultimate goal of artificial intelligence is to enable computers to simulate human thinking and behavior.

Artificial intelligence began to rise in the 1950s, but development was slow at that time due to limitations in data and hardware devices.

Machine Learning (Machine learning) is a subset of artificial intelligence and a way to achieve artificial intelligence, but not the only way. It specifically studies how computers can simulate or realize human learning behaviors to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. It began to flourish around the 1980s, giving rise to a large number of mathematical statistics-related machine learning models.

Deep Learning (Deep learning) is a subset of machine learning inspired by the human brain, consisting of artificial neural networks (ANN) that mimic similar structures present in the human brain. In deep learning, learning occurs through a deep, multi-layered “network” of interconnected “neurons”. The term “deep” usually refers to the number of hidden layers in the neural network. It exploded in growth after 2012 and is widely used in many scenarios.

Let’s take a look at the definitions of machine learning by renowned scholars abroad:

Machine learning studies how computers can simulate human learning behaviors to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve themselves.

From a practical perspective, machine learning relies on big data, using various algorithms to allow machines to conduct deep statistical analysis of data for “self-learning”, enabling artificial intelligence systems to gain inductive reasoning and decision-making capabilities.

Through the classic spam filtering application, let’s further understand the principles of machine learning and what T, E, and P refer to in the definition:

2) Three Elements of Machine Learning

The three elements of machine learning include data, model, and algorithm. The relationship between these three elements can be illustrated by the following diagram:

(1) Data

Data-Driven: Data-driven refers to our reliance on objective quantitative data, supporting decision-making through proactive data collection and analysis. In contrast, experience-driven approaches, such as “shooting from the hip,” rely on intuition.

(2) Model & Algorithm

Model: In the realm of AI data-driven approaches, a model refers to a hypothesis function that makes decisions Y based on data X, which can take various forms, including computational and rule-based.

Algorithm: Refers to the specific computational methods for learning models. Statistical learning selects the optimal model from the hypothesis space based on training datasets and learning strategies, ultimately considering which computational methods to use to solve for the optimal model. It is typically an optimization problem.

3) Development History of Machine Learning

The term artificial intelligence first appeared in 1956, aimed at exploring effective solutions to certain problems. In 1960, the US Department of Defense used the concept of “neural networks” to train computers to mimic human reasoning processes.

Before 2010, technology giants like Google and Microsoft improved machine learning algorithms, elevating the accuracy of queries to new heights. Subsequently, with the increase in data volume, advanced algorithms, and improvements in computing and storage capacity, machine learning has further developed.

4) Core Technologies of Machine Learning

Classification: Applications train models using classified data, accurately classifying and predicting new samples based on the model.

Clustering: Identifies similarities and differences in massive datasets and aggregates them into multiple categories based on maximum commonalities.

Anomaly Detection: Analyzes the distribution patterns of data points to identify outliers that differ significantly from normal data.

Regression: Trains the model based on known attribute value data to find the best-fitting parameters, predicting the output value of new samples based on the model.

5) Basic Process of Machine Learning

The machine learning workflow includes several steps: data preprocessing (Processing), model learning (Learning), model evaluation (Evaluation), and new sample prediction (Prediction).

Data Preprocessing: Input (raw data + labels) → Processing (feature processing + scaling, feature selection, dimensionality reduction, sampling) → Output (test set + training set).

Model Learning: Model selection, cross-validation, result evaluation, hyperparameter selection.

Model Evaluation: Understanding the model’s score on the test dataset.

New Sample Prediction: Predicting the test set.

6) Application Scenarios of Machine Learning

As a data-driven methodology, machine learning has been widely applied in data mining, computer vision, natural language processing, biometric recognition, search engines, medical diagnosis, credit card fraud detection, securities market analysis, DNA sequencing, speech and handwriting recognition, and robotics.

Intelligent Healthcare: Intelligent prosthetics, exoskeletons, healthcare robots, surgical robots, smart health management, etc.

Facial Recognition: Access control systems, attendance systems, facial recognition security doors, electronic passports and IDs, and can also utilize facial recognition systems and networks to hunt down fugitives nationwide.

Control in Robotics: Industrial robots, robotic arms, multi-legged robots, vacuum robots, drones, etc.

2. Basic Terminology of Machine Learning

Supervised Learning (Supervised Learning): The training set has labeled information, and the learning methods include classification and regression.

Unsupervised Learning (Unsupervised Learning): The training set does not have labeled information, and the learning methods include clustering and dimensionality reduction.

Reinforcement Learning (Reinforcement Learning): A learning method with delayed and sparse feedback labels.

Example/Sample: A data point in the dataset.

Attributes/Features: Such as “color” and “root type”.

Attribute Space/Sample Space/Input Space X: The space formed by all attributes.

Feature Vector: A coordinate vector corresponding to each point in the space.

Label: Information about the example’s result, such as ((color=green, root type=curled, sound=blurry), good melon), where “good melon” is the label.

Classification: If the predicted value is discrete, such as “good melon” or “bad melon”, this type of learning task is called classification.

Hypothesis: The learned model corresponds to a certain underlying regularity about the data.

Truth: The underlying regularity itself.

Learning Process: Aims to find or approximate the truth.

Generalization Ability: The learned model’s ability to apply to new samples. Generally, the larger the training sample, the more likely it is to obtain a model with strong generalization ability through learning.

3. Classification of Machine Learning Algorithms

1) Problem Scenarios of Machine Learning Algorithms

Machine learning has developed into a multi-disciplinary intersection over the past 30 years, involving probability theory, statistics, approximation theory, convex analysis, computational complexity theory, and other disciplines. The theory of machine learning mainly designs and analyzes algorithms that enable computers to automatically “learn”.

Machine learning algorithms automatically analyze data to discover patterns and use these patterns to predict unknown data.

The theory of machine learning focuses on feasible and effective learning algorithms. Many inference problems are difficult to solve, so some machine learning research develops easily manageable approximate algorithms.

The main categories of machine learning are: supervised learning, unsupervised learning, and reinforcement learning.

Supervised Learning: Learns a function from a given training dataset, allowing predictions based on this function when new data arrives. The training set includes both input and output, or features and targets. The targets in the training set are labeled by humans. Common supervised learning algorithms include regression analysis and statistical classification.

For more summaries of supervised learning algorithm models, please refer to the article by ShowMeAI AI Knowledge Skills Quick Reference | Machine Learning – Supervised Learning (the public account cannot be redirected, link at the end of this article).

Unsupervised Learning: In contrast to supervised learning, the training set does not have artificially labeled results. Common unsupervised learning algorithms include Generative Adversarial Networks (GAN) and clustering.

For more summaries of unsupervised learning algorithm models, please refer to the article by ShowMeAI AI Knowledge Skills Quick Reference | Machine Learning – Unsupervised Learning.

Reinforcement Learning: Learns how to take actions through observation. Each action affects the environment, and the learner makes judgments based on feedback from the observed surroundings.

2) Classification Problems

Classification problems are a crucial component of machine learning. The goal is to determine which known sample class a new sample belongs to based on certain features of known samples. Classification problems can be subdivided as follows:

Binary Classification Problem: Represents a classification task with two categories for determining which known sample class the new sample belongs to.

Multiclass Classification Problem: Represents a classification task with multiple categories.

Multilabel Classification Problem: Assigns a series of target labels to each sample.

For more information on machine learning classification algorithms: KNN algorithm, logistic regression algorithm, naive Bayes algorithm, decision tree model, random forest classification model, GBDT model, XGBoost model, support vector machine model, etc. (the public account cannot be redirected, link at the end of this article).

3) Regression Problems

For more information on machine learning regression algorithms: decision tree model, random forest classification model, GBDT model, regression tree model, support vector machine model, etc.

4) Clustering Problems

For more information on machine learning clustering algorithms: clustering algorithms.

5) Dimensionality Reduction Problems

For more information on machine learning dimensionality reduction algorithms: PCA dimensionality reduction algorithm.

4. Model Evaluation and Selection in Machine Learning

1) Machine Learning and Data Fitting

The most typical supervised learning in machine learning includes classification and regression problems. In classification problems, we learn a “decision boundary” to distinguish data; in regression problems, we learn a curve that fits the distribution of the samples.

2) Training Set and Dataset

Taking house price estimation as an example, we will discuss the involved concepts.

Training Set: (Training Set): Helps train the model, simply put, it is the data used to determine the parameters of the fitting curve.

Test Set: (Test Set): Used to test the accuracy of the already trained model.

Of course, this test set does not guarantee the correctness of the model; it merely indicates that similar data will yield similar results with this model. During model training, all parameters are adjusted and fitted based on the data in the existing training set, which may lead to overfitting, meaning that parameters fit well for the training data but perform poorly on new data requiring prediction.

3) Empirical Error

Learning on the training set data. The error of the model on the training set is called “empirical error”. However, empirical error is not always better when smaller; we hope to achieve good predictive results on new, unseen data.

4) Overfitting

Overfitting refers to a model performing well on the training set but poorly on cross-validation and test sets, meaning the model’s predictions on unseen samples are mediocre, indicating poor generalization capability.

How to prevent overfitting?? Common methods include Early Stopping, Data Augmentation, Regularization, Dropout, etc.

Regularization: Refers to adding a regularization term to the objective function, generally including L1 regularization and L2 regularization. L1 regularization is based on the L1 norm, adding the sum of the absolute values of the parameters to the objective function.

Data Augmentation: Involves obtaining more data that meets the requirements, which is either independent and identically distributed or approximately independent and identically distributed. Common methods include collecting more data from the source, duplicating existing data with random noise, resampling, estimating data distribution parameters based on the current dataset, and generating more data using that distribution.

DropOut: Achieved by modifying the structure of the neural network itself.

5) Bias

Bias refers to the degree of deviation in model fitting. Given numerous training sets, the expected fitted model is the average model. Bias is the difference between the true model and the average model.

Simple models are a set of straight lines, and the average model obtained after averaging is a straight dashed line, which differs significantly from the true model curve (the gray shadow area is large). Thus, simple models typically have high bias.

Complex models have highly variable functions, and after averaging, the maximum and minimum values will cancel each other out, resulting in a smaller difference from the true model curve. Therefore, complex models generally exhibit low bias (the yellow curve and the green dashed line nearly overlap).

6) Variance

Variance typically refers to the stability (simplicity) of the model. Simple models correspond to functions that are all horizontal straight lines, and the average model’s function is also a horizontal straight line, so simple models have very low variance and are not sensitive to data fluctuations.

Complex models correspond to functions that are highly irregular and have no rules, while the average model’s function is a smooth curve. Therefore, complex models have high variance and are very sensitive to data fluctuations.

7) Balance Between Bias and Variance

8) Performance Metrics

Performance metrics are numerical evaluation standards that measure the generalization ability of the model, reflecting the current problem (task requirements). Using different performance metrics may lead to different evaluation results. For more detailed content, see Model Evaluation Methods and Criteria (link at the end of this article).

(1) Regression Problems

Judging the “goodness” of a model depends not only on the algorithm and data but also on the current task requirements. Common performance metrics for regression problems include: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, R-squared, etc.

Mean Absolute Error (Mean Absolute Error, MAE), also known as mean absolute deviation, is the average of the absolute deviations of all label values from the predicted values of the regression model.

Mean Absolute Percentage Error (Mean Absolute Percentage Error, MAPE) is an improvement over MAE, considering the ratio of absolute error to the true value.

Mean Squared Error (Mean Square Error, MSE) is the average of the squares of the deviations of all label values from the predicted values of the regression model.

Root Mean Squared Error (Root-Mean-Square Error, RMSE), also known as standard error, is derived from MSE by taking the square root. RMSE is used to measure the deviation between observed values and true values.

R-squared, the coefficient of determination, reflects the proportion of total variation in the dependent variable that can be explained by the independent variables in the regression model. The closer the ratio is to 1, the better the current regression model explains the data and accurately describes the true distribution of the data.

(2) Classification Problems

Common performance metrics for classification problems include Error Rate, Accuracy, Precision, Recall, F1 Score, ROC Curve, AUC Curve, and R-squared, etc. For more detailed content, see Model Evaluation Methods and Criteria (link at the end of this article).

Error Rate: The proportion of misclassified samples among the total samples.

Accuracy: The proportion of correctly classified samples among the total samples.

Precision (also known as accuracy), refers to the proportion of truly correct results among the results you believe to be correct.

Recall (also known as sensitivity), refers to the proportion of truly correct results among all correct results in the entire dataset (both retrieved and not retrieved).

F1 Score is a measure that considers both precision and recall, defined as the harmonic mean of precision and recall: F1 Score’s general form – Fβ allows us to express different preferences for precision and recall.

ROC Curve (Receiver Operating Characteristic Curve) comprehensively considers the quality of probability prediction sorting and reflects the “expected generalization performance” of the learner across different tasks. The vertical axis of the ROC curve represents the “true positive rate” (TPR), while the horizontal axis represents the “false positive rate” (FPR).

AUC (Area Under ROC Curve) is the area under the ROC curve, representing the quality of sample prediction sorting.

From a higher perspective, understanding AUC: still using the example of identifying anomalous users, a high AUC value means that the model can identify as many anomalous users as possible while maintaining a low false positive rate for normal users (not misclassifying a large number of normal users as anomalous just to identify anomalous users).

9) Evaluation Methods

How can we reliably evaluate when we do not have unknown samples? The key is to obtain reliable “test set data” (Test Set), meaning the test set (for evaluation) should be mutually exclusive from the training set (for model learning).

Common evaluation methods include: Hold-out, Cross Validation, Bootstrap. For more detailed content, see Model Evaluation Methods and Criteria (link at the end of this article).

Hold-out is one of the most common evaluation methods in machine learning, which retains a validation sample set from the training data; this part of the data is not used for training but for model evaluation.

Another common evaluation method in machine learning is cross-validation. K-fold cross-validation averages the results of k different group trainings to reduce variance, making the model’s performance less sensitive to data partitioning, and utilizing data more fully, leading to more stable evaluation results.

Bootstrap is a non-parametric method for estimating population values using small samples, widely used in evolutionary and ecological research.

Bootstrap generates a large number of pseudo-samples through sampling with replacement, calculating statistics on these pseudo-samples to estimate the overall distribution of the data.

10) Model Tuning and Selection Criteria

We hope to find a model that has good expressiveness for the current problem and low complexity:

A model with good expressiveness can learn the patterns and rules in the training data well;
A low-complexity model has a smaller variance, is less prone to overfitting, and has better generalization.

11) How to Choose the Optimal Model

(1) Validation Set Evaluation Selection

Split the data into training and validation sets.
For the prepared candidate hyperparameters, train the model on the training set and evaluate on the validation set.

(2) Grid Search/Random Search Cross Validation

Generate candidate hyperparameter sets through grid search/random search.
Evaluate the effect of each set of hyperparameters using cross-validation.
Select the best-performing hyperparameters.

(3) Bayesian Optimization

Hyperparameter tuning based on Bayesian optimization.

Source: Internet, please delete if infringing!

END

Edit / Fan Ruiqiang

Review / Fan Ruiqiang

Recheck / Fan Ruiqiang

Click below

Introduction to Machine Learning Basics