Post WWII Momentum

Genius of European Mathematicians

Leo Breiman, Photo: Salford Systems

Independent Work in 1950s

K-Means is probably the most popular unsupervised algorithm today used for clustering tasks in projects.

But how was K-Means invented? What’s the history behind this intuitive and useful clustering algorithm? We tried to compile a few facts about who found K-Means algorithm and when.

K-Means was originally discovered by Polish Mathematician Hugo Steinhaus in 1956. He introduced his algorithmic clustering approach in French paper “Sur la Division des Corps Matérielsen Parties” naming it bagging predictors

In 1967, James MacQueen introduced the term K-Means in his article “Some Methods for Classification and Analysis of MultivariateObservations” published by University of California. As James MacQueen was the first person to ever use the name K-Means sometimes he is also attributed as the founder of K-Means algorithm.

It’s more accurate to say it was found sometime in 1950s by multiple researchers independently but official documentation point to Hugo Steinhaus as the first official record of K-Means algorithm.

Cool fact: Hugo Steinhaus also made amazing contributions to functional analysis with Banach-Steinhaus Theorem. He also played a key role in the reconstruction of mathematics in Poland after WWII.

Regardless of its original roots, K-Means has also been extensively studied and its variants have been developed like most algorithms.

Continued Contributions

David Arthur and Sergei Vassilvitskii came up with kmeans++ augmentation in 2007 which improves both speed and accuracy often dramatically according to their paper.

Clustering Data Streams by Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan was an improvement introducing and extensively studying K-Medoid, K-Median, K-Center and K-box algorithms in 2003. Stanford paper here.

K-Means uses centroids to cluster data

Summary

K-Means is a very useful and most popular clustering algorithm. It’s relatively fast and scales well so it can be used to cluster large datasets as well.

In this tutorial we have learned about the history of K-Means algorithm. If you are curious about other supervised and unsupervised Machine Learning Algorithms you can check our main ML page: