Mastering KNN with Scikit-Learn: A Classification Journey with Iris Dataset

KNN (K-Nearest Neighbors) algorithm is a simple and intuitive supervised learning algorithm, widely used for classification and regression tasks. This article will guide you step by step on how to implement the KNN algorithm using the scikit-learn library in Python, and we will practice with the Iris dataset. Let’s explore how to classify the Iris flowers using the KNN algorithm!

1. Preparation

First, we need to install the necessary libraries. If you haven’t installed scikit-learn yet, you can do so with the following command:

pip install scikit-learn

Next, we import the required libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

2. Load Data

We will use the classic Iris dataset. This dataset contains measurements of three different types of Iris flowers (Setosa, Versicolor, and Virginica), with 50 samples of each type, and each sample has four features: sepal length, sepal width, petal length, and petal width.

iris = datasets.load_iris()
X = iris.data
y = iris.target

3. Data Preprocessing

To help the model perform better, we usually need to standardize the data.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

4. Split Dataset

We will split the data into training and testing sets to evaluate the model’s performance.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

5. Build KNN Model

Now, we will build a KNN classifier. To find the best K value, we can try multiple K values and see how they perform.

k_values = list(range(1, 31))
accuracies = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracies.append(accuracy_score(y_test, y_pred))

6. Select Best K Value

We can plot the model’s accuracy for different K values to find the best K value.

plt.figure(figsize=(10, 6))
plt.plot(k_values, accuracies, marker='o')
plt.title('Accuracy vs. Number of Neighbors (K)')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

7. Model Evaluation

Based on the best K value found from the chart, we will retrain the model and evaluate its performance.

best_k = k_values[np.argmax(accuracies)]
knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train, y_train)
y_pred_best = knn_best.predict(X_test)

print(f"Best K value: {best_k}")
print("Classification Report:")
print(classification_report(y_test, y_pred_best))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best))
print("Accuracy Score:")
print(accuracy_score(y_test, y_pred_best))

8. Conclusion

Through this practical tutorial, we have not only learned how to implement the KNN algorithm using the scikit-learn library, but also mastered how to perform classification tasks using the Iris dataset. We hope this tutorial helps you better understand and apply the KNN algorithm!

Tips:

When selecting the K value, consider the issues of overfitting and underfitting. A smaller K value may lead to overfitting, while a larger K value may lead to underfitting.
Using cross-validation can help to more robustly select the K value.

If you enjoyed this tutorial, please follow our public account for more exciting content on machine learning!

We hope this practical tutorial is helpful to you! If you have any questions or want to know more details, feel free to leave a comment for discussion.