Understanding Nearest Neighbor Classification

Do you know how proteins, vegetables, and fruits are classified? In life, we find that things that are neither crunchy nor sweet are proteins, crunchy but not sweet are vegetables, and fruits are often sweet, possibly crunchy or not. Based on this life experience (birds of a feather flock together), do you know whether a tomato is a fruit or a vegetable? First, let’s look at the following set of data.

Food	Sweetness	Crunchiness	Food Type
Grapes	8	5	Fruit
Green Beans	3	7	Vegetable
Nuts	3	6	Protein
Oranges	7	3	Fruit

Now, if we know that the sweetness of a tomato is 6 and its crunchiness is 4, if we plot this data on a 2D plane with sweetness on the x-axis and crunchiness on the y-axis, we can easily calculate the straight-line distance between the tomato and the other four foods. For example, the distance between the tomato and the green beans is:

d(TomatoGreen Beans)=√(6−3)2+(4−7)2=4.2d(TomatoGreen Beans)=(6−3)2+(4−7)2=4.2

Based on the above algorithm, we calculated the distances between the tomato and grapes, green beans, nuts, and oranges, which are 2.2, 4.2, 3.6, and 1.4 respectively.

We find that the distance between the tomato and the orange is the shortest, so we conclude that the tomato is a fruit. Here we actually only selected the nearest “neighbor”, which is k=1, a 1NN classification.

If we use the KNN algorithm with k=3, it will vote among the three nearest neighbors, namely the orange, grape, and nut. Since there are two votes for fruit (2/3 of the votes), the tomato is classified as a fruit again.

So how do we choose the appropriate k?

If we choose a very large k, it will reduce the influence of noisy data on the model, or reduce the model’s fluctuations caused by noise, but it may cause the classifier to ignore subtle but important patterns. For example, if k is chosen as the total number of observations, and if the proportion of fruits in the original data is very high, the voting results may easily lead to the tomato being classified as a fruit.
If we choose a single nearest neighbor, this can cause noisy data or outliers to excessively influence the model, leading to instability.

Clearly, the best k value should be some value between the two extremes. In practice, k is generally chosen as the square root of the sample size. For example, if there are 15 samples, the square root of 15 is 3.87, and we can set k=4. Another method is to test multiple k values based on various test datasets and choose the one that provides the best classification performance. Unless the noise is very large, a large dataset can make the choice of k less important, as there are enough neighbors participating in the “vote”.

Requirements for KNN Algorithm Data

If we add another feature to the dataset, such as the spiciness of the food, with spiciness values ranging from 0 to over 1 million, while sweetness and crunchiness values range from 1 to 10, the scale difference causes spiciness to have a much greater impact on the distance function than sweetness and crunchiness. If we do not adjust the data, we can foresee that the distance measurement will be greatly influenced by spiciness, while the effects of crunchiness and sweetness can be almost ignored.

The solution is to standardize the original data so that each feature’s values fall within the range of 0 to 1, or to make each feature comparable in magnitude. There are two common methods:

Min-max normalization. Each value of feature X is subtracted by its minimum value and divided by the range of feature X. Here we directly create a function for easy future calls.
```
normalize_mm = function(x){  return ((x-min(x)) / (max(x)-min(x)))}
```
Z-score normalization. Each value of feature X is subtracted by the mean of feature X and then divided by the standard deviation of feature X. The function is created as follows:
```
normalize_z = function(x){  return ((x-mean(x)) / sd(x))}
```

Note: To calculate distances for non-numeric data, the original data must first be converted into numeric data. A typical solution is to use dummy variable encoding. For example, 1 represents male and 0 represents female.

Step 1: Collecting Data

The data in this article comes from the Wisconsin Breast Cancer dataset, which contains 699 samples and 11 variables. It can be found in the UCI Machine Learning Database (http://archive.ics.uci.edu/ml/index.php)

# Get the original data
loc = "https://archive.ics.uci.edu/ml/machine-learning-databases/"
ds = "breast-cancer-wisconsin/breast-cancer-wisconsin.data"
url = paste(loc, ds, sep="")
breast = read.table(url, sep=",", header=FALSE, na.strings="?")
# Add variable names to the original data
names(breast) = c("ID", "clumpThickness", "sizeUniformity","shapeUniformity",                  "maginalAdhesion","singleEpithelialCellSize", "bareNuclei",                  "blandChromatin", "normalNucleoli", "mitosis", "class")
# Set factor variables, where class is coded as 2 for benign and 4 for malignant
breast$class = factor(breast$class, levels=c(2,4),                  labels=c("benign", "malignant"))

Variable descriptions in the dataset:

ID: Sample ID number
clumpThickness: Thickness of the mass
sizeUniformity: Uniformity of cell size
shapeUniformity: Uniformity of cell shape
maginalAdhesion: Marginal adhesion
singleEpithelialCellSize: Size of a single epithelial cell
bareNuclei: Bare nuclei
blandChromatin: Bland chromatin
normalNucleoli: Normal nucleoli
mitosis: Mitosis
class: Class. Benign indicates benign, malignant indicates malignant.

Step 2: Exploring and Preparing Data

# Clean the data
library(tidyverse)
breast %>%   select(-ID) %>%   # Remove ID column, which is irrelevant to the model  na.omit() ->df  # The proportion of missing values is very small, so we directly delete it
# Standardize the data except for the class column
df_n = as.data.frame(lapply(df[1:9], normalize_mm))
# Create training and validation datasets
set.seed(1234)  # Set random seed for reproducibility
train = sample(nrow(df), 0.7*nrow(df))  # Randomly sample 70% of the original data for training
df_train = df_n[train,]  # df_train is the training dataset
df_validate = df_n[-train,]  # df_validate is the validation dataset
df_train_labels = df[train, 10] # Diagnostic results for the training dataset
df_validate_labels = df[-train, 10] # Diagnostic results for the validation dataset
# Preliminary statistics on training and validation datasets
df_train %>%  cbind(df_train_labels) %>%   rename(class = df_train_labels) %>%   mutate(value = 1) %>%  group_by(class) %>%  summarise(total = sum(value)) ->train_stat
df_validate %>%  cbind(df_validate_labels) %>%  rename(class = df_validate_labels) %>%   mutate(value = 1) %>%  group_by(class) %>%  summarise(total = sum(value)) ->validate_stat

The first variable ID is not included in the data analysis, and the last variable class is the output variable.

For each sample, the other nine variables are cell features related to the diagnosis of malignant tumors, and no single variable can serve as a criterion for distinguishing benign from malignant. The goal of modeling is to find some combination of the nine cell features to accurately predict malignant tumors.The data is extracted from the UCI database, missing values are removed, and the training set and validation set are randomly divided, with 478 samples in the training set (70%), including 317 benign samples and 161 malignant samples; the validation set contains 205 samples (30%), including 127 benign and 78 malignant.

Step 3: Training the Model Based on Data

library(class)
knn.pred = knn(train = df_train,               test = df_validate,               cl = df_train_labels,               k = 22)    # The parameter is set to 22 because there are 478 training samples, and the square root is 22

Step 4: Evaluating Model Performance

library(gmodels)
CrossTable(x = df_validate_labels,           y = knn.pred,           dnn = c("Actual", "Predicted"),           prop.chisq = FALSE)

## ##  ##    Cell Contents## |-------------------------|## |                       N |## |           N / Row Total |## |           N / Col Total |## |         N / Table Total |## |-------------------------|## ##  ## Total Observations in Table:  205 ## ##  ##              | Predicted ##       Actual |    benign | malignant | Row Total | ## -------------|-----------|-----------|-----------|##       benign |       125 |         2 |       127 | ##              |     0.984 |     0.016 |     0.620 | ##              |     0.940 |     0.028 |           | ##              |     0.610 |     0.010 |           | ## -------------|-----------|-----------|-----------|##    malignant |         8 |        70 |        78 | ##              |     0.103 |     0.897 |     0.380 | ##              |     0.060 |     0.972 |           | ##              |     0.039 |     0.341 |           | ## -------------|-----------|-----------|-----------|## Column Total |       133 |        72 |       205 | ##              |     0.649 |     0.351 |           | ## -------------|-----------|-----------|-----------|## ##

The top left corner represents true negatives, while the bottom right corner represents true positives. The prediction accuracy is (125+70)/205*100%=95.12%. We also find that in the bottom left corner, there are 8 samples that are actually malignant but were incorrectly classified as benign by KNN, which are false negatives; in the top right corner, there are 2 samples that are actually benign but were incorrectly classified as malignant by KNN, which are false positives. However, the prediction accuracy is still quite high, and the model is satisfactory.

Step 5: Improving Model Performance

Here we can try two simple changes: one is to consider using z-score normalization during data standardization, and the other is to try several different k values. It is important to note that excessively pursuing prediction accuracy may lead to overfitting, increasing the likelihood of fitting noise, thereby weakening generalization ability.

In determining the k value, the caret package can also play a significant role.

library(caret)
set.seed(1234) # Set random seed for reproducibility
grid = expand.grid(.k = seq(2, 20, by = 1))
control = trainControl(method = "cv")
df_validate %>%  cbind(df_validate_labels) %>%  rename(class = df_validate_labels) ->train
knn.train = train(class~.,                  data = train,                   method = "knn",                  trControl = control,                  tuneGrid = grid)
knn.train

## k-Nearest Neighbors ## ## 205 samples##   9 predictor##   2 classes: 'benign', 'malignant' ## ## No pre-processing## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 185, 184, 185, 184, 185, 184, ... ## Resampling results across tuning parameters:## ##   k   Accuracy   Kappa    ##    2  0.9561905  0.9049269##    3  0.9564286  0.9067712##    4  0.9466667  0.8857997##    5  0.9611905  0.9170926##    6  0.9561905  0.9064544##    7  0.9511905  0.8953774##    8  0.9514286  0.8951847##    9  0.9609524  0.9163366##   10  0.9609524  0.9163366##   11  0.9511905  0.8933637##   12  0.9511905  0.8933637##   13  0.9511905  0.8933637##   14  0.9461905  0.8817695##   15  0.9461905  0.8817695##   16  0.9511905  0.8933637##   17  0.9414286  0.8714247##   18  0.9461905  0.8817695##   19  0.9464286  0.8830189##   20  0.9414286  0.8699629## ## Accuracy was used to select the optimal model using the largest value.## The final value used for the model was k = 5.

The report shows that when k=5, the Kappa statistic (used to measure the consistency of classification between two classifiers on observed values) is the highest.

Next, we will retrain the model using k=5:

knn.pred_new = knn(train = df_train,               test = df_validate,               cl = df_train_labels,               k = 5)
CrossTable(x = df_validate_labels,           y = knn.pred_new,           dnn = c("Actual", "Predicted"),           prop.chisq = FALSE)

## ##  ##    Cell Contents## |-------------------------|## |                       N |## |           N / Row Total |## |           N / Col Total |## |         N / Table Total |## |-------------------------|## ##  ## Total Observations in Table:  205 ## ##  ##              | Predicted ##       Actual |    benign | malignant | Row Total | ## -------------|-----------|-----------|-----------|##       benign |       123 |         4 |       127 | ##              |     0.969 |     0.031 |     0.620 | ##              |     0.969 |     0.051 |           | ##              |     0.600 |     0.020 |           | ## -------------|-----------|-----------|-----------|##    malignant |         4 |        74 |        78 | ##              |     0.051 |     0.949 |     0.380 | ##              |     0.031 |     0.949 |           | ##              |     0.020 |     0.361 |           | ## -------------|-----------|-----------|-----------|## Column Total |       127 |        78 |       205 | ##              |     0.620 |     0.380 |           | ## -------------|-----------|-----------|-----------|## ##

We compare the two results and find that the false negatives have decreased by 4, true positives have increased by 2, but overall prediction accuracy has improved to (123+74)/205*100%=96.10%, an increase of nearly 1 percentage point.

The author also tried using z-score normalization on the original data, determining the optimal k value to be 11 based on the Kappa statistic, resulting in true negatives of 123, true positives of 73, false negatives of 5, false positives of 4, and accuracy of (123+73)/205*100%=95.61%, which is also a slight improvement.

Finally, it should be noted that there are other methods to weight distances, and the kknn package provides 10 different weighting methods that can be tried if interested.

Appreciate the Author

——————————————

Previous Highlights:

In these days when academia is nationalized, I miss Einstein a bit.
“The ‘Cryptocurrency Circle’s Jia Yueting’ at a sky-high price collides with Buffett, Wang Xiaochuan responds from a distance: Fraud!”
Why Huawei? Five aspects regarding Huawei’s blockade.

Implementing K-Nearest Neighbors Algorithm in R

Table of Contents