Weak artificial intelligence has made significant breakthroughs in recent years and has quietly become an essential part of everyone’s life. Take our smartphones as an example, let’s see how much magic of artificial intelligence is hidden within.
The image below shows some common applications installed on a typical iPhone. Many people might not guess that artificial intelligence technology is the core driving force behind many applications on smartphones.
Figure 1: Related applications on the iPhone
Smart assistant applications like Apple Siri, Baidu Duer, and Microsoft Xiaoice are trying to revolutionize the way you communicate with your phone, turning it into a smart secretary. News applications rely on intelligent recommendation technologies to push the most suitable content to you; Meitu Xiuxiu automatically completes artistic creations for recruitment and videos;
Shopping applications use smart logistics technology to help businesses distribute goods efficiently and safely, improving buyer satisfaction; Didi Chuxing helps drivers choose routes, and in the near future, autonomous driving technology will redefine smart travel. All these developments are mainly due to a method of achieving artificial intelligence—machine learning.
Traditional machine learning algorithms include decision trees, clustering, Bayesian classification, support vector machines, EM, Adaboost, etc.
This article will provide a common sense introduction to commonly used algorithms, without code or complex theoretical derivations, just visual explanations to understand what these algorithms are and how they are applied.
Classify based on some features, each node poses a question, dividing the data into two categories, and continues to ask questions. These questions are learned from existing data, and when new data is input, it can be classified to the appropriate leaf based on the questions on this tree.
Figure 2: Schematic diagram of decision tree principle
Randomly select data from the source data to form several subsets:
Figure 3-1: Schematic diagram of random forest principle
The S matrix is the source data, with 1-N data points, A, B, C are features, and the last column C is the category:
Generate M sub-matrices randomly from S:
These M subsets produce M decision trees: input new data into these M trees, obtain M classification results, and count which category has the most predictions, then that category is taken as the final prediction result.
Figure 3-2: Random forest effect demonstration
When the prediction target is a probability, the range needs to satisfy being greater than or equal to 0 and less than or equal to 1; at this time, a simple linear model cannot achieve this because when the domain is not within a certain range, the range will also exceed the specified interval.
Figure 4-1: Linear model diagram
So at this time, a model with such a shape would be better:
Figure 4-2
So how to obtain such a model?
This model needs to satisfy two conditions: “greater than or equal to 0” and “less than or equal to 1”. A model greater than or equal to 0 can choose absolute values or squares; here we use the exponential function, which is always greater than 0; less than or equal to 1 can use division, where the numerator is the value itself and the denominator is itself plus 1, which will definitely be less than 1.
Figure 4-3
After some transformations, we get the logistic regression model:
Figure 4-4
By calculating from the source data, we can obtain the corresponding coefficients:
Figure 4-5
Finally, we obtain the logistic graph:
Figure 4-6: LR model curve graph
To separate two categories, we want to obtain a hyperplane, and the optimal hyperplane is the one that maximizes the margin to the two categories, where the margin is the distance from the hyperplane to the closest point. As shown in the figure, Z2 > Z1, so the green hyperplane is better.
Figure 5: Classification problem schematic
Represent this hyperplane as a linear equation, where one class above the line is greater than or equal to 1, and the other class is less than or equal to -1:
The distance from a point to the plane is calculated based on the formula in the figure:
So the expression for total margin is as follows, the goal is to maximize this margin, which requires minimizing the denominator, thus transforming it into an optimization problem:
For example, with three points, find the optimal hyperplane, defining the weight vector = (2, 3) – (1, 1):
Obtaining the weight vector as (a, 2a), substituting two points into the equation, substituting (2, 3) gives a value of 1, substituting (1, 1) gives a value of -1, solving for a and the intercept w0 gives the expression of the hyperplane.
After finding a, substituting into (a, 2a) gives the support vector. Substituting a and w0 into the hyperplane equation gives the support vector machine.
For example, in NLP application: given a piece of text, return the sentiment classification of whether this text’s attitude is positive or negative:
Figure 6-1: Problem case
To solve this problem, we can only look at some of the words:
This text will be represented only by some words and their counts:
The original question is: given a sentence, which category does it belong to? This is transformed into a simpler problem through Bayes’ rules:
The problem becomes: what is the probability of this sentence appearing in this category? Of course, don’t forget the other two probabilities in the formula. For example: the probability of the word “love” appearing in the positive case is 0.1, while in the negative case it is 0.001.
Figure 6-2: NB algorithm result demonstration
When given new data, determine which category it belongs to based on the majority class among the k nearest points.
Example: to distinguish between “cats” and “dogs”, using “claws” and “sound” as two features, the circles and triangles are known classifications; what category does this “star” represent?
Figure 7-1: Problem case
When k = 3, the three lines connecting the points are the three nearest points, and since there are more circles, this star belongs to the cat category.
Figure 7-2: Algorithm step demonstration
First, a set of data needs to be divided into three categories: pink values are high, yellow values are low. Initially, we select the simplest values 3, 2, and 1 as the initial values for each category. In the remaining data, each is calculated for distance to the three initial values, and classified into the category of the nearest initial value.
Figure 8-1: Problem case
After classification, calculate the average value of each category to serve as new center points for the next round:
Figure 8-2
After several rounds, if the groupings no longer change, we can stop:
Figure 8-3: Algorithm result demonstration
Adaboost is one of the methods of Boosting. Boosting combines several classifiers that do not perform well to create a classifier that performs better.
In the figure below, the two decision trees on the left and right do not perform well individually, but when the same data is input and the two results are considered together, it increases credibility.
Figure 9-1: Algorithm principle demonstration
An example of Adaboost in handwritten recognition can capture many features on the canvas, such as the direction of the starting point, the distance between the starting point and the end point, etc.
Figure 9-2
During training, we obtain the weight of each feature; for example, if the beginning of 2 and 3 is very similar, this feature plays a small role in classification, so its weight will be small.
Figure 9-3
However, this alpha angle has strong recognition, so this feature’s weight will be larger, and the final prediction result is a comprehensive consideration of these features.
Figure 9-4
Neural Networks are suitable when an input may fall into at least two categories: NN consists of several layers of neurons and their connections. The first layer is the input layer, and the last layer is the output layer. Both the hidden layer and the output layer have their own classifiers.
Figure 10-1: Neural network structure
The input is fed into the network, activated, and the computed score is passed to the next layer, activating the subsequent neural layers. Finally, the scores on the output layer nodes represent the scores belonging to each category. In the example below, the classification result is class 1; the same input is transmitted to different nodes, and the reason for obtaining different results is that each node has different weights and biases, which is also called forward propagation.
Figure 10-2: Algorithm result demonstration
Markov Chains consist of states and transitions. For example, based on this sentence ‘the quick brown fox jumps over the lazy dog’, we can derive the Markov chains.
Steps: first assign each word as a state, then calculate the transition probabilities between states.
Figure 11-1: Markov principle diagram
This is the probability calculated from one sentence. When you perform statistics using a large amount of text, you will obtain a larger state transition matrix, for example, the words that can follow ‘the’ and their corresponding probabilities.
Figure 11-2: Algorithm result demonstration
The ten types of machine learning algorithms mentioned above are practitioners in the development of artificial intelligence. Even today, they are still widely used in data mining and small sample artificial intelligence problems.
Deep Blue Academy: An online education platform for cutting-edge technologies such as artificial intelligence.
The key is whether it can solve the problems of the students.