Introduction to Machine Learning: Knowledge Sharing

Definition of Machine Learning

Machine Learning essentially allows computers to learn patterns from data and predict future data based on the learned patterns.

Machine Learning includes algorithms such as clustering, classification, decision trees, Bayesian methods, neural networks, and deep learning (Deep Learning).

The basic idea of Machine Learning is to mimic the process of human learning behavior, where we generally summarize rules based on experience to predict the future. The basic process of Machine Learning is as follows:

Introduction to Machine Learning: Knowledge Sharing

Basic Process of Machine Learning

Development History of Machine Learning

The timeline of the development of Machine Learning is as follows:

Development History of Machine Learning

From the proposal of the Turing test in the 1950s to Samuel’s development of the Western chess program, it marked the official entry of Machine Learning into its development phase.

The development from the mid-60s to the late 70s almost stagnated.
The proposal of the backpropagation (BP) algorithm in the 1980s for training multi-parameter linear programming (MLP) brought Machine Learning into a renaissance period.
The “decision tree” (ID3 algorithm) proposed in the 1990s, followed by the support vector machine (SVM) algorithm, shifted Machine Learning from a knowledge-driven to a data-driven approach.
In the early 21st century, Hinton proposed deep learning (Deep Learning), revitalizing Machine Learning research.

Starting from 2012, with the enhancement of computing power and the support of massive training samples, deep learning (Deep Learning) has become a hot topic in Machine Learning research, driving widespread industrial applications.

Categories of Machine Learning

After decades of development, Machine Learning has derived many classification methods, which can be divided into supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning based on learning modes.

Supervised Learning

Supervised Learning is learning a model from labeled training data and then using the model to predict the label of a given new data. The higher the accuracy of the classification label, the higher the accuracy of the learning model and the more precise the prediction results.

Supervised Learning is mainly used for regression and classification.

Common regression algorithms in supervised learning include linear regression, regression trees, K-nearest neighbors, Adaboost, neural networks, etc.

Common classification algorithms in supervised learning include naive Bayes, decision trees, SVM, logistic regression, K-nearest neighbors, Adaboost, neural networks, etc.

Semi-Supervised Learning

Semi-Supervised Learning is a learning mode that uses a small amount of labeled data and a large amount of unlabeled data.

Semi-Supervised Learning focuses on incorporating unlabeled samples into supervised classification algorithms to achieve semi-supervised classification.

Common semi-supervised learning algorithms include Pseudo-Label, Π-Model, Temporal Ensembling, Mean Teacher, VAT, UDA, MixMatch, ReMixMatch, FixMatch, etc.

Unsupervised Learning

Unsupervised Learning is the process of finding hidden structures from unlabeled data.

Unsupervised Learning is mainly used for association analysis, clustering, and dimensionality reduction.

Common unsupervised learning algorithms include Sparse Auto-Encoder, Principal Component Analysis (PCA), K-Means algorithm, DBSCAN algorithm, Expectation-Maximization algorithm (EM), etc.

Reinforcement Learning

Reinforcement Learning is similar to supervised learning but does not use sample data for training; it learns through trial and error.

In Reinforcement Learning, there are two interactive objects: the agent and the environment, along with four core elements: policy, reward function, value function, and environment model, where the environment model is optional.

Reinforcement Learning is commonly used in applications such as robot obstacle avoidance, board games, advertising, and recommendations.

To facilitate reader understanding, gray dots represent unlabeled data, while other colored dots represent labeled data of different categories. The schematic diagram of supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning is as follows:

The Path to Machine Learning Applications

Machine Learning abstracts real-world problems into mathematical models, uses historical data to train the data model, then solves new data based on the data model, and converts the results back into answers to real-world problems. The general steps for implementing Machine Learning applications are as follows:

Abstract real problems into mathematical problems;
Data preparation;
Select or create models;
Model training and evaluation;
Prediction results;

Here, we take a simple introduction to a competition on Kaggle called Cats vs. Dogs as an example, and those interested can experiment themselves.

1. Abstracting Real Problems into Mathematical Problems

Real Problem: Given an image, let the computer determine whether it is a cat or a dog?

Mathematical Problem: A binary classification problem, where 1 indicates the classification result is a dog, and 0 indicates the classification result is a cat.

2. Data Preparation

Data download link: https://www.kaggle.com/c/dogs-vs-cats.

After downloading the Kaggle cat and dog dataset, it is divided into three files: train.zip, test.zip, and sample_submission.csv.

The training set contains 25,000 images of cats and dogs, with an equal number of each, and each image includes the image itself and the image name. The naming convention follows the “type.num.jpg” format.

Training Set Example

The test set contains 12,500 images of cats and dogs, without labeling whether it is a cat or a dog, and each image is named according to the “num.jpg” convention.

Test Set Example

The sample_submission.csv needs to have the final test results written into the .csv file.

Sample Submission Example

We will divide the data into three parts: training set (60%), validation set (20%), and test set (20%) for subsequent validation and evaluation work.

3. Model Selection

There are many models in Machine Learning, and the choice of which model to use should be based on data type, sample size, and the problem itself.

For this problem, which mainly deals with image data, one could consider using a Convolutional Neural Network (CNN) model for binary classification, as one of the advantages of choosing CNN is that it avoids the preprocessing steps for images (such as feature extraction). The structure of the CNN for cat and dog recognition is shown below:

The lowest layer is the input layer, which reads in the image as the data input for the network; the uppermost layer is the output layer, which predicts and outputs the category of the input image. Since we only need to distinguish between cats and dogs, the output layer has only two neuron calculation units; the layers between the input and output layers are called hidden layers, also known as convolutional layers, and here we set three hidden layers.

4. Model Training and Evaluation

We predefine the loss function to calculate the loss value and evaluate the training model using accuracy. The loss function LogLoss serves as the model evaluation index:

Accuracy is used to measure the accuracy of the algorithm’s prediction results:

TP (True Positive) is the number of positive class predictions that are correctly identified.

FP (False Positive) is the number of negative class predictions that are incorrectly identified as positive.

TN (True Negative) is the number of negative class predictions that are correctly identified.

FN (False Negative) is the number of positive class predictions that are incorrectly identified as negative.

Loss and accuracy during training

5. Prediction Results

Using the trained model, we load an image to recognize and see the recognition effect:

Trend Analysis of Machine Learning

The true research and development of Machine Learning should start from the 1980s. We used the AMiner platform to conduct statistical analysis on recent Machine Learning papers, generating the following development trend chart:

It can be seen that deep neural networks (Deep Neural Network), reinforcement learning (Reinforcement Learning), convolutional neural networks (Convolutional Neural Network), recurrent neural networks (Recurrent Neural Network), generative models (Generative Model), image classification (Image Classification), support vector machines (Support Vector Machine), transfer learning (Transfer Learning), active learning (Active Learning), and feature extraction (Feature Extraction) are hot research topics in Machine Learning.

The research heat of technologies related to deep learning, represented by deep neural networks and reinforcement learning, is rising rapidly and remains a research hotspot in recent years.

Finally, quoting a line from Han Yu’s “On Learning”:

“Excellence is achieved through diligence, and neglect leads to decline; success is achieved through thought, and destruction comes from carelessness.”

- END -
Compared to the Excel series of books with cumulative sales of 150,000 copies, it makes it easy for you to master data analysis skills. You can click the link below to learn more and make a purchase:

Leave a Comment Cancel reply