How to use non-hierarchical clustering k-means 04/21 Update SLTechnology News&Howtos

How to use non-hierarchical clustering k-means

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to use non-hierarchical clustering k-means". In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

nonhierarchical clustering

Non-hierarchical clustering is a method of grouping a group of objects into simple groups based on the similarity between objects within the group as much as possible, and the number of groups is preset before analysis. Non-hierarchical clustering requires a preset structure, such as assuming that there are k clusters, then all objects are arbitrarily divided into k groups, and then on this basis, replacement iterations are continuously performed to achieve optimal grouping results.

k-means partition

k-means algorithm is an iterative linear clustering algorithm. It needs to select the same number of objects randomly as the initial cluster center according to the given number of clusters, divide the clusters according to the distance between all objects and the cluster center, until all objects are divided, and then calculate the objective function value according to the current classification situation:

where N is the total number of objects, K is the number of clusters given, rik means 1 when sample xi is classified into cluster k, otherwise 0, uk is the initial cluster center coordinate, and the mean of each cluster coordinate is selected as the next cluster center after the first iteration, which is also the origin of k-means. It can be seen that this formula actually reflects the within-group variance of all clusters, and the smaller the sum of within-group variances, the more ideal the division. Therefore, k-means iterates over the above procedure to minimize the total within-group variance. The whole process is to establish classification by identifying areas of high density of objects.

Next we compare hierarchical clustering of objects by Euclidean distance with k-means clustering using the same dataset, given a cluster number of 5, as follows: #read data=read.table (file="otu_table.txt", header=TRUE, check.names=FALSE)rownames(data)=data[, 1]data=as.matrix (data[, -1])#Normalize the sum of species data for each sample (i.e., find relative abundance) library(vegan)data=decostand (data, MARGIN=2, "total")*100otu=t(data)#hierarchical clustering otu_dist=vegdist (otu, method="euclidean", diag=TRUE, upper=TRUE, p=2)hcl=hclust (otu_dist, method="ward.D") cluster1 =cutree(hcl, 5)#k-means clustering, centers is the preset number of clusters, nstart is the number of iterations kms=kmeans(otu, centers=5, nstart=100) cluster2 =kms$clustertable(cluster1, cluster2)

The results are as follows:

It can be seen that the results of the two clusters are not the same, and only two of them are exactly the same. In general, k-means is not suitable for clustering raw data that contains many zero values. Since k-means can only cluster raw data, to use other distances (bray-curtis, etc.), only the raw data is calculated by a distance matrix for PCoA analysis, and then k-means clustering is performed based on the extracted principal coordinates.

The kmeans function can only return a preset number of clusters at a time, but in general we don't know how many clusters are better. cascadeKM of vegan package can run cluster analysis with different preset number of groups at once, using the following method: #filter the best number of clusters multikms=cascadeKM(otu, inf.gr =2, sup.gr =22, iter=100, criterion="ssi")plot(multikms, sortg=TRUE)

In the example above, the number of clusters ranges from 2 to 22, and sortg=TRUE indicates that the samples are reordered according to the clustering results. "ssi" is a simple structrre index, which is used to evaluate the quality of clustering results. Generally, the more the number of clusters and the more complex the structure, the higher the ssi index. You can also choose "calinski" or Calinski-Harabasz index, which generally decreases with the number of clusters, and the larger the value, the better the clustering result. The results of the analysis are as follows:

We used data from 66 samples, and the color diagram on the left shows how each object (i.e., sample) belongs to a group at different classification levels. The number of different colors in each row is the number of classification clusters. The statistics at the end of the classification (i.e., at the end of the iteration) at each classification level are shown on the right, here represented by ssi values. From this we can see that different from hierarchical clustering, non-hierarchical clustering runs independently at different clustering levels.

Generally, we hope to obtain a large enough number of clusters and a small enough ssi value. From the results, we can see that k=5 is an ideal clustering result. The content of "how to use non-hierarchical clustering k-means" is introduced here. Thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.