I’ve created these step-by-step machine learning algorithm implementations in Python for everyone who is new to the field and might be confused with the different steps.
kNN is a fairly new but very interesting and special Machine Learning Algorithm.
Although it has its performance restrictions there are intuition reasons why it’s still one of the most common algorithms to start working on a dataset with.
Also it can do both classification and regression and since it’s so unique this is a huge bonus.
You can read about k Nearest Neighbor invention history or check out the series regarding optimization parameters of kNN algorithms.
This Machine Learning tutorial elaborates the Simple kNN implementation and tries to lay out a very simple, easy to follow and easy to digest step by step explanation.
10 mins
Advanced
Provided by HolyPython.com
I’ve split up k-Nearest Neighbor implementation to 2 different categories here:
(Red for the actual machine learning work and black font signifies preparation phase)
Down the page I’ve also color coded the steps in a different way to group similar steps with each other.
pandas can be useful for constructing dataframes and scikit learn is the ultimate library for simple machine learning operations and practicing machine learning.
Reading data is simple but there can be important points such as: dealing with columns, headers, titles, constructing data frames etc.
Machine Learning models can be created with a very simple and straight-forward process using scikitlearn. In this case we will create a kNN Classifier object from the kNN Classifier module of scikitlearn.neighbors library.
Once the model is ready, predictions can be done on the test part of the data. Furthermore, I enjoy predicting foreign values that are not in the initial dataset just to observe the outcomes the model creates. .predict method is used for predictions.
We need a nice dataset that’s sensible to analyze with machine learning techniques, particularly k Nearest Neighbor in this case. Scikitlearn has some cool sample data as well.
Even splitting data is made easy with Scikit-learn, for this operation we will use train_test_module from scikitlearn library.
Machine Learning models are generally fit by training data. This is the part where training of the model takes place and we will do the same for the k-Nearest Neighbor model.
Finally, scikitlearn library’s metrics module is very useful to test the accuracy of the model’s predictions. This part could be done manually as well but metrics module brings lots of functionality and simplicity to the table.
First the import part for libraries:
###Importing Libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn import metrics
from sklearn import datasets
It’s time to find some data to work with. For the simplicity I will suggest using pre-included datasets library in scikitlearn. They are great for practice and everything is already taken care. So, there won’t be a complication such as missing values or invalid characters etc. while you’re learning.
Let’s import the iris dataset which is perfect and readily available for everyone:
###Importing Dataset
iris = datasets.load_iris()
Data prep is an important step in Machine Learning.
Pandas DataFrame class is used to construct a data frame. Data frames are very useful when working with large datasets with different titles.
Data usually provides a few common things: feature(s): values about each row that might be affecting the outcomes and an outcome (or label or value) or sometimes multiple outcomes (multiclass).
We work on and with features to train and test a model that can predict the outcomes in future when we only have the features. That’s why training/test data has outcomes included but future predictions won’t necessarily have an outcome readily available.
###Constructing Data Frame
data = pd.DataFrame({"sl":iris.data[:,0], "sw":iris.data[:,1], "pl":iris.data[:,2], "pw":iris.data[:,3], 'species': iris.target})
It’s a good idea to print the data frame that we’ve just constructed to have a visual idea about the structure, data and values.
This is another standard Machine Learning step.
We need to split data to get:
###Splitting train/test data
X=data[['sl','sw','pl','pw']]
y=data["species"]
X_tr, X_ts, y_tr, y_ts = tts(X,y, test_size=30/100)
Now we can create a kNN Classifier object and put machine learning to work using the training data.
This is also the step for hyper parameter optimization and tuning. So, for instance, n_neighbors is probably the most significant optimization parameter for kNN algorithms and you can see it being assigned to 5 neighbors in this example.
###Creating kNN Classifier Model
KNN = knn(n_neighbors=5)
Training using the .fit method.
###Training the Model
KNN.fit(X_tr, y_tr)
Here comes the first prediction. It’s done on the test data. So, we can compare with actual outcome values and see the accuracy.
###Making Predictions
y_pr = KNN.predict(X_ts)
Evaluation, validation and cross validation with other models are all important steps of training a Machine Learning Model.
Piece of cake with the metrics module sklearn offers.
###Evaluating Prediction Accuracy
print("Accuracy:",metrics.accuracy_score(y_ts, y_pr))
And how to predict foreign data values with our trained model:
###Making Prediction with Foreign Data
print(KNN.predict([[1, 5, 3.5 ,6]]))
You can see the full one piece code in this page: k Nearest Neighbor Simple Implementation