K-Means Optimization & Parameters
K-Means Optimization Parameters Explained
These are the most commonly adjusted K-Means Algorithm parameters. Let’s see how to can be useful in tuning and optimization.
n_clusters: (default 8) by far the most commonly adjusted K-Means parameter n_cluster defines the amount of clusters that will be generated.
Either through a trial error experiment, observing the initial results with default value (8) or observing raw dataset you can run K-Means a few times and adjust the cluster amount to the desired value.
n_clusters might take some expertise to get right and this expertise might mean machine learning expertise as well as domain expertise. In either case, K-Means will provide useful insight even if you don’t get n_clusters exactly right.
n_init: (default: 10) Another significant parameter n_init is used to define the number of initialization attempts for centroids of clusters.
Initialization of centroids is an important concept for K-Means algorithm. If the initialization is not correct or as intended this value can be increased to make more attempts to initialize the model with optimum centroids. Default value 10 usually produces good results without compromising too much computational efficiency.
max_iter: (default: 300) Also significant, max_iter is the maximum iterations k-means algorithm will make before giving the end results. This parameter is 300 by default and in most cases that’s more than needed so it includes a safety margin as well.
Commonly you will observe ideal clusters and center points forming much before the 100 or even 50 iterations. But since K-Means is usually a fast clustering algorithm it doesn’t hurt much to assign a value like 300 to maximum iterations to be on the safe side. You can check the efficiency of iterations and K-Means in general by checking the inertia of K-Means using inertia_ attribute.
max_iter value directly affects the runtime and scalability of K-Means implementations since K-Means time complexity is shown as O(N*K*T) where N is samples, K is clusters and T is iteration amount.
If scalability is a real concern and task involves big data, max_iter can be optimized to make K-Means run faster by decreasing max_iter. Similarly, if dataset requires a higher amount of iteration for increased accuracy max_iter can be increased to higher values such as 500.
KM = KMeans(n_clusters=4)
KM = KMeans(n_init=20)
KM = KMeans(max_iter=500)
More K-Means Optimization Parameters for fine tuning
Further on, these parameters can be used for further optimization, to avoid performance and size inefficiencies as well as suboptimal algorithm results: