k-nearest Neighbor

Pros & Cons

k Nearest Neighbor

Advantages

1- Simplicity

kNN probably is the simplest Machine Learning algorithm and it might also be the easiest to understand. It's even simpler in a sense than Naive Bayes, because Naive Bayes still comes with a mathematical formula.

So, if you're totally new to technical fields or if your audience requires a very simple explaining, kNN might be a perfect place for starters.

2- Non-parametric

Non-parametric means kNN doesn't make assumptions regarding the dataset. If you don't know too much about the dataset initially this feature can be a lifesaver.

Also, as new values are added, kNN will successfully adjust based on the n_neighbors parameter you have provided.

Being non-assumptive can mean discovering hidden relations in your data and this can mean gaining a whole new perspective or finding out surprise results, which is usually good, depending on the surprise.

You can refer to this page to read more about k-Nearest Neighbor optimization parameters.

3- Great Sidekick

Due to its comprehensible nature, many people love to use kNN as a warm-up tool. It's perfect to test the waters with or make a simple prediction.

k Nearest Neighbor can also be used to create input to another machine learning algorithm or it can also be used to process the results of another machine learning algorithm.

Finally, kNN's uniqueness offers a great value in terms of cross-validation. It's a model that's sensitive to outliers or complex features which makes it a great candidate to challenge output from other machine learning algorithms such as Naive Bayes or Support Vector Machines.

4- Very Sensitive

If you want to explore features with complex relations or if your data has outliers that you'd like to keep in the considerations, kNN can do a great job in this sense.

Especially with the comfort and simplicity of adjusting neighbor parameter everything becomes intuitive and practical.

5- Versatility

kNN is a great tool for classification but it can be used for regression as well.

Paired with its other features such as intuitiveness, simplicity, practicality and accuracy, it's definitely great to be able to use kNN for regression purposes every now and then.

In this sense it's powerful and can be very useful.

6- Non-Linear Performance

Another versatile trait of k Nearest Neighbor is how good it performs in non-linear situations. Given its simplicity even when other non-linear opportunities are available (i.e.: SVM with non-linear kernels comes to mind), kNN is such a simple straightforward option to try.

No wonder it's common to see professionals apply kNN first to get a sense or different view of data.

Holy Python is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

k Nearest Neighbor

Disadvantages

1- Costly Computation

Unfortunately, k Nearest Neighbor is a hungry machine learning algorithm since it has to calculate the proximity between each neighbors for every single value in the dataset.

This doesn't mean it's completely unusable, it's just that it falls out of favor and becomes impractical when you enter the world of big data or similar applications. Something to keep in mind with this sympathetic algorithm.

2- RAM Monster

It's not just the CPU that takes a hit with k Nearest Neighbor, RAM also gets occupied when this little monster is working. kNN stores all its values in the RAM and again, you might not notice it with small implementations but try to work on a large database and

3- Significant Parameters

Although kNN has few parameters to tune this can trick the analyst. k parameter for neighbor amounts and parameter for how distance is calculated can make a huge difference in the outcomes.

Luckily, it's extremely easy and straightforward to play with these parameters and experiment with the way they affect the results. The real risk is not being aware of the fact that they will make a huge impact.

4- Small Dimensions Only

If you want to work on datasets with many features this can be problematic with kNN.

Let's say you have 1 million rows with 100 classes. With a 30/70 test/training split, kNN will have to calculate 100 subtraction, 100 squares and 1 square root for each row (700.000 training rows). Just gives an idea why it gets difficult with large datasets and high feature/class numbers when kNN is being used.

5- Equal Treatment

Equal Treatment is almost always good but here is a case.

Since kNN is non-parametric and it doesn't make any assumptions, this means all the attributes will be treated as equally important for the results.

This is simply not always the case and if you want to navigate around noise in a noisy data kNN may not be suitable for this case.

6- Handling Missing Values

kNN can't handle data with missing values unless you apply a process called imputation. This means missing values in your data will be filled with certain numerical values such as averages, ones, zeros etc.

This can be a tedious extra task and it can also introduce wrong bias to the data.

Luckily, there are readily available tools to impute data in a practical way such as KnnImpute (i.e.: sklearn.impute.KNNImputer) and dealing with missing data is usually just a reality of Data Science.

wrap-up

So intuitive and easy to comprehend.

kNN is a go-to Machine Learning Algorithm for many people not because its extremely competent but because it's so practical. It's like the person you favor sometimes, cuz he is family.

Everybody can understand or explain how kNN works in a couple of minutes and the results it gives are usually surprisingly accurate.

Obvious short comings are, it takes up computation resources and it won't be suitable for too many features or very large datasets. Things that might cause problems in some industrial applications but for many cases kNN will do just fine.

Very Intuitive

When you discover about the way kNN functions, everything makes sense and it's so easy to understand its logic. Not that other machine learning algos are that hard to understand but kNN is just readily understandable regardless one's background.

Non-assumptive

Great for discovering hidden patterns or working with unstructured data

Accurate

Might not be the most accurate algorithm always but kNNs are usually quite accurate.

Very Sensitive

You get to include outliers and anomalies in your analysis.

Large Data Problems

kNN will struggle with large datasets especially if data has high dimension.

Very Sensitive

Although this is rather an advantage, it can be a problem if you'd like to take outliers and noisy data into consideration. If you're looking for an insensitive algorithm in this sense you might want to look into Naive Bayes Classifier.

Very Sensitive

Missing Data

Missing Data is not handled as well as Naive Bayes in kNN. If you have too much missing data in dataset this can be a significant problem for kNN.

Holy Python

k-nearest Neighbor

Pros & Cons

k Nearest Neighbor

Advantages

1- Simplicity

2- Non-parametric

3- Great Sidekick

4- Very Sensitive

5- Versatility

6- Non-Linear Performance

k Nearest Neighbor

Disadvantages

1- Costly Computation

2- RAM Monster

3- Significant Parameters

4- Small Dimensions Only

5- Equal Treatment

6- Handling Missing Values

wrap-up

k Nearest Neighbor Pros & Cons Summary

Why k Nearest Neighbor?

Very Intuitive

Non-assumptive

Accurate

Very Sensitive

Large Data Problems

Very Sensitive

Very Sensitive

Missing Data