k Nearest Neighbor (kNN) (Step by Step)

I’ve created these step-by-step machine learning algorithm implementations in Python for everyone who is new to the field and might be confused with the different steps.

kNN is a fairly new but very interesting and special Machine Learning Algorithm.

Although it has its performance restrictions there are intuition reasons why it’s still one of the most common algorithms to start working on a dataset with.

Also it can do both classification and regression and since it’s so unique this is a huge bonus.

You can read about k Nearest Neighbor invention history or check out the series regarding optimization parameters of kNN algorithms.

This Machine Learning tutorial elaborates the Simple kNN implementation and tries to lay out a very simple, easy to follow and easy to digest step by step explanation.

Estimated Time

10 mins

Skill Level

Advanced

Content Sections

Course Provider

Provided by HolyPython.com

I’ve split up k-Nearest Neighbor implementation to 2 different categories here:

(Red for the actual machine learning work and black font signifies preparation phase)

Down the page I’ve also color coded the steps in a different way to group similar steps with each other.

  1. Import the relevant Python libraries
  2. Import the data
  3. Read / clean / adjust the data (if needed)
  4. Create a train / test split
  5. Create the kNN model object
  6. Fit the model
  7. Predict
  8. Evaluate the accuracy
Let’s read more about each individual step and what’s achieved with each of them:

1 Import Libraries

pandas can be useful for constructing dataframes and scikit learn is the ultimate library for simple machine learning operations and practicing machine learning.

3 Read the Data

Reading data is simple but there can be important points such as: dealing with columns, headers, titles, constructing data frames etc.

5 Create the Model

Machine Learning models can be created with a very simple and straight-forward process using scikitlearn. In this case we will create a kNN Classifier object from the kNN Classifier module of scikitlearn.neighbors library.

7 Predict

Once the model is ready, predictions can be done on the test part of the data. Furthermore, I enjoy predicting foreign values that are not in the initial dataset just to observe the outcomes the model creates. .predict method is used for predictions.

2 Import the Data

We need a nice dataset that’s sensible to analyze with machine learning techniques, particularly k Nearest Neighbor in this case. Scikitlearn has some cool sample data as well.

4 Split the Data

Even splitting data is made easy with Scikit-learn, for this operation we will use train_test_module from scikitlearn library.

6 Fit the Model

Machine Learning models are generally fit by training data. This is the part where training of the model takes place and we will do the same for the k-Nearest Neighbor model.

8 Evaluation

Finally, scikitlearn library’s metrics module is very useful to test the accuracy of the model’s predictions. This part could be done manually as well but metrics module brings lots of functionality and simplicity to the table.

1- Importing the libraries (pandas and sklearn libraries)

First the import part for libraries:

  • pandas is imported for data frames
  • train_test_split from sklearn.model_selection makes splitting data for train and test purposes very easy and proper
  • sklearn.neighbor provides the actual model for kNN Classifier
  • datasets module of sklearn has great datasets making it easy to experiment with AI & Machine Learning
  • metrics is great for evaluating the results we’ll get from k Nearest Neighbor
###Importing Libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn import metrics
from sklearn import datasets

2- Importing the data (iris dataset)

It’s time to find some data to work with. For the simplicity I will suggest using pre-included datasets library in scikitlearn. They are great for practice and everything is already taken care. So, there won’t be a complication such as missing values or invalid characters etc. while you’re learning.

Let’s import the iris dataset which is perfect and readily available for everyone:

###Importing Dataset
iris = datasets.load_iris()

3- Reading the data (scikitlearn datasets and pandas dataframe)

Data prep is an important step in Machine Learning.

Pandas DataFrame class is used to construct a data frame. Data frames are very useful when working with large datasets with different titles.

Data usually provides a few common things: feature(s): values about each row that might be affecting the outcomes and an outcome (or label or value) or sometimes multiple outcomes (multiclass). 

We work on and with features to train and test a model that can predict the outcomes in future when we only have the features. That’s why training/test data has outcomes included but future predictions won’t necessarily have an outcome readily available.

###Constructing Data Frame
data = pd.DataFrame({"sl":iris.data[:,0], "sw":iris.data[:,1], "pl":iris.data[:,2], "pw":iris.data[:,3], 'species': iris.target})

It’s a good idea to print the data frame that we’ve just constructed to have a visual idea about the structure, data and values.

4- Splitting the data (train_test_split module)

This is another standard Machine Learning step.

We need to split data to get: 

  • training feature(s) and label(s) 
  • test feature(s) and label(s)
It’s a rather simple process (step) thanks to Scikit learn’s train_test_split module.
 
  • I named the variables X_tr, y_tr for training and X_ts, y_ts for test input. This is up to your taste or your circumstances.
  • X_tr, X_ts will be assigned to a part of the features
  • y_tr, y_ts will be assigned to a part of outcomes
  • Split ratio can be assigned using test_size parameter. This is an important parameter and something you should experiment with to get a better understanding. 1/3rd or 30% usually are reasonable ratios.
  • Then model works on X_tr and y_tr for training.
  • Then we will test it on X_ts and y_ts to see how successful the model is.
###Splitting train/test data
X=data[['sl','sw','pl','pw']]
y=data["species"]
X_tr, X_ts, y_tr, y_ts = tts(X,y, test_size=30/100)

5- Creating the model (neighbors.kNeighborsClassifier)

Now we can create a kNN Classifier object and put machine learning to work using the training data.

This is also the step for hyper parameter optimization and tuning. So, for instance, n_neighbors is probably the most significant optimization parameter for kNN algorithms and you can see it being assigned to 5 neighbors in this example.

###Creating kNN Classifier Model
KNN = knn(n_neighbors=5)

6- Fitting the model (Training with features(X) and outcomes (y))

Training using the .fit method.

###Training the Model
KNN.fit(X_tr, y_tr)

7- Making predictions (.predict method)

Here comes the first prediction. It’s done on the test data. So, we can compare with actual outcome values and see the accuracy.

###Making Predictions
y_pr = KNN.predict(X_ts)

8- Evaluating results (scikitlearn metrics module)

Evaluation, validation and cross validation with other models are all important steps of training a Machine Learning Model.

Piece of cake with the metrics module sklearn offers.


###Evaluating Prediction Accuracy
print("Accuracy:",metrics.accuracy_score(y_ts, y_pr))

Bonus: Predicting foreign data

And how to predict foreign data values with our trained model:

###Making Prediction with Foreign Data
print(KNN.predict([[1, 5, 3.5 ,6]]))

You can see the full one piece code in this page: k Nearest Neighbor Simple Implementation