The Lazy Algorithm - KNN

Total Article 77

This article introduces one of the most basic and also the most “lazy” algorithms in machine learning – KNN (k-nearest neighbor). Do you know why it is called the laziest?

01|Algorithm Introduction:

KNN is short for k-nearest neighbor, which indicates the K closest points. This algorithm is commonly used to solve classification problems. The principle of the algorithm is to first find the K values that are closest to the value A to be classified, and then determine which category the majority of these K values belong to, thus classifying value A into that category.

This is actually consistent with how we evaluate people in life. If you want to know what kind of person someone is, you just need to find the K people who are closest (in a good way) to them, and then see what kind of people these K individuals are, so you can judge what kind of person they are.

02|Three Elements of the Algorithm:

Based on the principle of this algorithm, we can break it down into three parts: the first part is to decide the K value, i.e., how many nearby values to find; the second part is the distance calculation, which is to find the K closest values; the third part is to determine the classification rule, which is the standard used to judge which category it belongs to.

1. Selection of K value

The selection of K value will significantly impact the results of the KNN algorithm. Below is a simple example to illustrate this: As shown in the figure, which class should the green circle be assigned to, the red triangle or the blue square? If K=3, since the proportion of red triangles is 2/3, the green circle will be assigned to the red triangle class. If K=5, since the proportion of blue squares is 3/5, the green circle will be assigned to the blue square class.

Figure 2.1 – Source from the Internet

As you can see, the selection of K value directly affects the evaluation results. If the K value is chosen too large, it is equivalent to predicting using training instances from a larger domain. This might seem like more data could lead to more accuracy, but that is not necessarily true. If you want to obtain a larger K value, you will need to expand the distance further, which naturally reduces the prediction accuracy.

Again, using the analogy of judging what kind of person someone is, if you choose a large K value, such as the entire class, and predict based on the situation of everyone in that class, it is obviously inaccurate.

If the K value is too small, these are likely to be outliers, which will also affect the prediction results.

Too large is not good, and too small is not good either. So what should we do? The simplest and most effective method is to try. In the previous article, we mentioned that one method of model selection is cross-validation. We can also use cross-validation when selecting K values.

2. Distance Measurement

When we evaluate the closeness of relationships between people, we do not have a quantifiable relationship; we only use some words to describe the closeness, such as best friend (childhood friend), roommate, classmate.

However, in statistical learning, we have a quantifiable measure when we evaluate the closeness of two entities, and here we use Euclidean distance.

Euclidean distance, also known as Euclidean metric, refers to the actual distance between two points in m-dimensional space.

The Euclidean distance between two points a(x1,y1) and b(x2,y2) in a two-dimensional plane:

The Euclidean distance between two points a(x1,y1,z1) and b(x2,y2,z2) in three-dimensional space:

The Lazy Algorithm - KNN

The Euclidean distance between two n-dimensional vectors a(x11,x12,…,x1n) and b(x21,x22,…,x2n):

It can also be expressed in vector operation form:

Of course, we can also use other distances to measure the closeness of two entities, such as Manhattan distance (doesn’t that sound impressive?), for more details click: https://wenku.baidu.com/view/ebde5d0e763231126edb1113.html

3. Determining Classification Rules:

Currently, we use the majority voting classification rule, which means the category of the majority of the K closest values is the category of the value to be predicted.

04|Algorithm Steps:

Collect data: Find the text data to be trained.
Prepare data: Use Python to parse the text file.
Analyze data: Perform some statistical analysis on the data to have a basic understanding.
Train the algorithm: KNN does not have this step, which is why it is called the laziest algorithm.
Test the algorithm: Use cross-validation to test the algorithm with the provided data.
Use the algorithm: Directly apply the algorithm with a high accuracy obtained from testing to practical use.

05|Classifying Unknown Movies Using Python:

1. Background:

Assuming the difference between romantic movies and action movies can be determined by the number of fight scenes and the number of kisses, below are some movie categories along with their corresponding kiss and fight counts (training dataset).

Table 6-1: Source from the Internet

Now there is a movie A, known to have 18 fight scenes and 90 kisses, and we need to use the KNN algorithm to predict which category this movie belongs to.

2. Prepare data