K-Means Optimization & Parameters

K-Means Optimization Parameters Explained

These are the most commonly adjusted K-Means Algorithm parameters. Let’s see how to can be useful in tuning and optimization.

n_clusters: (default 8) by far the most commonly adjusted K-Means parameter n_cluster defines the amount of clusters that will be generated.

Either through a trial error experiment, observing the initial results with default value (8) or observing raw dataset you can run K-Means a few times and adjust the cluster amount to the desired value.

n_clusters might take some expertise to get right and this expertise might mean machine learning expertise as well as domain expertise. In either case, K-Means will provide useful insight even if you don’t get n_clusters exactly right.

n_init: (default: 10) Another significant parameter n_init is used to define the number of initialization attempts for centroids of clusters.

Initialization of centroids is an important concept for K-Means algorithm. If the initialization is not correct or as intended this value can be increased to make more attempts to initialize the model with optimum centroids. Default value 10 usually produces good results without compromising too much computational efficiency.

max_iter: (default: 300) Also significant, max_iter is the maximum iterations k-means algorithm will make before giving the end results. This parameter is 300 by default and in most cases that’s more than needed so it includes a safety margin as well. 

Commonly you will observe ideal clusters and center points forming much before the 100 or even 50 iterations. But since K-Means is usually a fast clustering algorithm it doesn’t hurt much to assign a value like 300 to maximum iterations to be on the safe side. You can check the efficiency of iterations and K-Means in general by checking the inertia of K-Means using inertia_ attribute.

max_iter value directly affects the runtime and scalability of K-Means implementations since K-Means time complexity is shown as O(N*K*T) where N is samples, K is clusters and T is iteration amount.

If scalability is a real concern and task involves big data, max_iter can be optimized to make K-Means run faster by decreasing max_iter. Similarly, if dataset requires a higher amount of iteration for increased accuracy max_iter can be increased to higher values such as 500.

Examples:

KM = KMeans(n_clusters=4)
KM = KMeans(n_init=20)
KM = KMeans(max_iter=500)

More parameters

More K-Means Optimization Parameters for fine tuning

Further on, these parameters can be used for further optimization, to avoid performance and size inefficiencies as well as suboptimal algorithm results:

  • init
  • tol
  • verbose

init

(default: k-means++)

init parameter is used to define the initialization algorithm for cluster centroids in K-Means implementations. k-means++ is a smart initialization algorithm which makes accuracy and performance improvements while initializing cluster centroids.

Additionally, random can be assigned to init parameter which will initialize centroids in a completely random manner. Finally, an array can also be passed to init parameter to define the centroid coordinates precisely in advanced K-Means implementations.

tol

(default: 1e-4)

tol parameter can be used to adjust the intended tolerance of convergence when inertia is calculated after each iteration.

If inertia change is below tolerance level defined by tol parameter K-Means machine learning model will converge and stop its operation otherwise iterations will continue to find more optimum inertia values.

Note: inertia is the sum of distance between cluster centers (centroids) and relative cluster members hence signifying the performance of clustering process.

verbose

(default: 0)

Verbose parameter signifies the information printing capability of machine learning algorithms while they are working. An integer value can be assigned to verbose and it is 0 by default meaning K-Means machine learning model won't print anything about the process while it works.

Similarly, a value like 1, 2 or 3 can be used to increase the information flow and can be useful when debugging or monitoring the model's working.

Official Scikit Learn Documentation: sklearn.cluster.KMeans.