This article is reprinted from the public account Xinkou Dushuo
KNN is a fundamental data classification algorithm in machine learning. Here, I will explain its utility through a story.
Let’s start with a joke.
During a matchmaking event, a beautiful girl asked the young man: Do you have a three-bedroom apartment?
The young man: No.
The girl then asked: Do you have a BMW 5?
The young man: No.
The girl was very displeased: Then forget it.
The matchmaking ended. The young man was very dejected and went home to tell his parents about it.
His father frowned: A car is negotiable; we could sell our Bentley to buy several BMWs.
But we shouldn’t downgrade our mansion to a three-bedroom apartment for this reason.
In this joke, the girl’s matchmaking algorithm is flawed; it is too crude, too simplistic, and fails to quantify the matchmaking standards, which are overly generalized.
Instead of setting a standard of three bedrooms, she should consider the size of the apartment in square meters.
Similarly, the car standard should not be based on brand but rather on the price of the vehicle.
This leads to missing a great romantic opportunity.
If the girl had used the KNN algorithm to select her matchmaking partners, she definitely would not have made such mistakes.
Let’s use the KNN algorithm to help the girl find her ideal partner.
Typically, matchmaking partners for a girl can be categorized into three types: A (very satisfied, eager to get a marriage certificate tomorrow),
B (generally okay, can spend some time together to see),
C (dissatisfied, does not want to contact again).
The criteria for judging the type of matchmaking partner are 2 factors: apartment size and vehicle price.
Below are some data on other girls’ boyfriends, which we will use as references (also known as sample data):
Boyfriend’s Name |
Apartment Size |
Vehicle Price |
Matchmaking Category |
Wang Sicong |
1000000 square meters (the properties in Wanda belong to his family) |
500 million |
A (very satisfied) |
Wu Yifan |
5000 square meters |
200 million |
A (very satisfied) |
Li Youcai |
120 square meters |
60 million |
B (can consider) |
Gao Henfu |
200 square meters |
50 million |
B (can consider) |
Wang Xiaoming |
60 square meters |
10 million |
B (can consider) |
Liu Meiqian |
12 square meters (rented) |
0.2 million (electric vehicle) |
C (pass) |
Now there is a matchmaking partner, Zhao Daping, with the following situation: apartment 100 square meters, car 30 million,
What type of matchmaking partner should he be?
Let’s calculate the distance (also known as difference) between him and the above sample data.
Here we will use Manhattan distance (what is Manhattan distance, Baidu it)
Boyfriend’s Name |
Zhao Daping’s Difference |
Wang Sicong |
|1000000-100|+|500-30|=999900+470=1000370(huge difference) |
Wu Yifan |
|5000-100|+|200-30|=4900+170=5070(not small distance) |
Li Youcai |
|120-100|+|60-30|=20+30=50(very close) |
Gao Henfu |
|200-100|+|50-30|=100+20=120(quite close) |
Wang Xiaoming |
|60-100|+|10-30|=40+20=60(quite close) |
Liu Meiqian |
|12-100|+|0.2-30|=88+29.8=117.8 |
Now let’s sort the above data from smallest to largest:
Boyfriend’s Name |
Zhao Daping’s Distance (from smallest to largest) |
Li Youcai |
50 |
Wang Xiaoming |
60 |
Liu Meiqian |
117.8 |
Gao Henfu |
120 |
Wu Yifan |
5070 |
Wang Sicong |
1000370 |
Here we take the top 4 boys (of course, we can also take 3 or 5 ; taking fewer is inaccurate, taking more is complex, usually not exceeding 20):
Li Youcai (B type), Wang Xiaoming(B type), Liu Meiqian (C type), Gao Henfu (B type).
Among these top 4 boys closest to Zhao Daping,B type is the most (3/4, which is 75%),
C type is less(1/4, which is 25%).
Thus, we take the type with the highest proportion and consider Zhao Daping to be of typeB, can consider.
This is theKNN algorithm (K-Nearest Neighbor).
Some students may ask: Why go through all this trouble? The girl’s previous standard was simple, but yours is too complicated, with sample data and sorting. Can’t we just set ranges? For example, those with apartments larger than 500 square meters are very satisfied, those from 500 to 90 square meters can be considered, and those less than 90 square meters are PASS, isn’t that simple?
However, reality is more complex; many things lack definite boundaries. If you set 90 as a boundary, what about someone with 89? According to the range division, this would be PASS, but in reality, 90 and 89 are quite similar. There’s no way to set ranges.
—————END—————
Friends who like this article are welcome to follow the public account Programmer Xiao Hui for more exciting content