How to use K-means clustering correctly 04/11 Update SLTechnology News&Howtos

How to use K-means clustering correctly

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, Xiaobian will bring you about how to use K means clustering correctly. The article is rich in content and analyzed and described from a professional perspective. After reading this article, I hope you can gain something.

The first lesson in clustering algorithms is often K-means clustering because it is simple and efficient. This article mainly talks about a few points beginners need to pay attention to when using K means clustering.

1. Input data generally requires scaling, such as normalization. The reason is simple. K means are based on distance measures, so if the dimensionality difference between different variables is too large, it may cause a few variables to "exert too much influence and cause monopoly."

2. If the variable types of the input data are different, some are numerical and some are categorical, special treatment is required. Method 1 converts categorical variables into numerical types, but the disadvantage is that if one hot encoding is used, it may cause a large increase in data dimensionality, and if label encoding is used, the order in the data cannot be handled well. Method 2 deals with numerical variables and categorical variables separately and combines the results. For details, please refer to Python implementations [1], such as K-mode and K-prototype.

3. Output results are not fixed and may vary from run to run. First of all, we should realize that K-means is random, and the results from initialization to convergence are often different. One idea is to impose fixed randomness, such as setting random state in sklearn to a fixed value. Another idea is that if your K-mean results vary greatly, such as the amount of data in different clusters varies greatly over multiple runs, then the K-mean does not fit your data and do not try to stabilize the results [2].

I personally prefer the latter view. K-means is easy to understand, but the effect is average. If the results of multiple runs are unstable, it is not recommended to use K-means. I did a simple experiment where I clustered some data five times using K means:

km = MiniBatchKMeans(n_clusters=5)for i in range(5): labels = km.fit_predict(seg_df_norm) label_dist = np.bincount(labels)/seg_df_norm.shape[0]*100 print(label_dist)

Print out the proportion of data in each cluster as follows. It can be seen that the proportion of data in each cluster varies greatly in several runs. Here we understand that the order can be random, but the proportion should be relatively fixed, so K means do not fit the current dataset.

[ 24.6071 5.414 25.4877 26.7451 17.7461][ 54.3728 19.0836 0.1314 26.3133 0.0988][ 12.9951 52.5879 4.6576 15.6268 14.1325][ 19.4527 44.2054 7.5121 24.9078 3.9221][ 21.3046 49.9233 2.1886 15.2255 11.358 ]

4. Run time can often be optimized by choosing the best tool library. Basically, the current K-means realization is K-means++, and the speed is good. However, other methods such as MiniBatchKMeans [3] can still be used when the data volume is too large. Millions of data points can often be clustered in seconds, and Sklearn's implementation is recommended.

5. Limited validity on high dimensional data. Algorithms based on distance metrics generally have similar problems, that is, the meaning of distance changes in high-dimensional space, and not all dimensions have meaning. In this case, the K-means result tends to be bad, while the sub-spacing method may work better.

6. The trade-off between operational efficiency and performance. However, when the amount of data rises to a certain extent, such as> 100,000 pieces of data, many algorithms cannot be used. Recently I read an interesting article comparing the performance of different algorithms with the amount of data [4]. On the author's dataset, only K-means and HDBSCAN are available when the amount of data exceeds a certain level.

The author also made the following figure for reference and comparison. In his experiments, most algorithms wait long enough to run overnight if they exceed 100,000 pieces of data. In my experiments, I also found that many algorithms (Spectral Clustering, Agglomerative Clustering, etc.) face memory errors.

File "C:\Users\xyz\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py", line 1652, in pdist dm = np.empty((m * (m - 1)) // 2, dtype=np.double)MemoryError

Therefore, it is not difficult to see that the biggest advantage of K-means algorithm is that it runs fast, can process a large amount of data, and is easy to understand. However, the shortcomings are also obvious, that is, the algorithm performance is limited, and it may not be the best option in high dimensions.

A more crude conclusion is that when the amount of data is small, other algorithms can be tried first. When the amount of data is too large, you can try HDBSCAN. Try K-means only when the data is large and dimensionality or quantity cannot be reduced.

A significant signal of concern is that if the results of multiple runs of K-means vary greatly, there is a high probability that K-means does not fit the current data and the results should be analyzed with caution.

The above is how to use K means clustering correctly for everyone to share. If you happen to have similar doubts, you may wish to refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.