An introduction to the clustering method of R language 05/03 Update SLTechnology News&Howtos

An introduction to the clustering method of R language

2025-05-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "introduction to clustering methods of R language". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. hierarchical clustering

1) distance and similarity coefficient

R language uses dist (x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2) to calculate the distance. Where x is a sample matrix or data box. Method indicates which distance to calculate. The values of method are:

The euclidean Euclidean distance is squared and then squared.

Maximum Chebyshev distance

Manhattan absolute distance

Canberra Lance distance

Minkowski Minkovsky distance, specify p value when using

Binary qualitative variable distance.

Qualitative variable distance: record the 0:0 pairing number in m items as M0, 1:1 pairing number as M1, unmatched pairing number m2, distance = M1 / (m1+m2)

When diag is TRUE, give the distance on the diagonal. When upper is TURE, the value on the upper triangular matrix is given.

R language uses scale (x, center = TRUE, scale = TRUE) to centralize and standardize the data matrix.

For example, only centralize scale (XJI scaleScaleScaleScaleScaleSecretF)

R language using sweep (x, MARGIN, STATS, FUN= "-", …) Operate on the matrix. A MARGIN of 1 indicates that the operation is performed in the direction of the row, and a value of 2 represents the direction of the column. STATS is the parameter of the operation. FUN is an operation function, and the default is subtraction. Next, we use sweep to standardize the range transformation of matrix x.

Sometimes instead of classifying samples, we classify variables. At this point, instead of calculating the distance, we calculate the similarity coefficient between variables. The commonly used ones are angle and correlation coefficient.

R language calculates the cosine of the angle between two vectors:

Correlation coefficient using cor function

2) hierarchical clustering method

Hierarchical clustering. First calculate the distance between the samples. Merge the nearest points into the same class each time. Then, the distance between classes is calculated, and the nearest classes are merged into one large class. Keep merging until you synthesize a class. Among them, the calculation methods of the distance between the class and the class are: the shortest distance method, the longest distance method, the middle distance method, the class average method and so on. For example, the shortest distance method defines the distance between classes as the maximum distance between samples.

R language uses hclust (d, method = "complete", members=NULL) for hierarchical clustering.

Where d is the distance matrix.

Method represents the merging methods of classes, including:

Single shortest distance method

Complete maximum distance method

Median intermediate distance method

Mcquitty similarity method

Average class averaging method

Centroid centroid method

Ward deviation sum of squares method

You can then use rect.hclust (tree, k = NULL, which = NULL, x = NULL, h = NULL,border = 2, cluster = NULL) to determine the number of classes. Tree is the object to be found. K is the number of classifications and h is the threshold of the distance between classes. Border is a color drawn and used for classification.

Second, dynamic clustering k-means

Hierarchical clustering does not change after the class is formed. And when the data is larger, it takes up more memory.

Dynamic clustering, first draw a few points, gather the surrounding points. Then calculate the center of gravity or average of each class, take the calculated results as the classification point, and repeat them over and over again. Until the results of the classification converge. Kmeans (x, centers, iter.max = 10, nstart = 1) algorithm = c ("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen") is mainly used for clustering in r language. Centers is the number of initial classes or the center of the initial class. Iter.max is the number of iterations. Nstart is the number of random sets when centers is a number. Algorithm is an algorithm. The default is *.

Kmean clustering Analysis using knn package

Back up the dataset, leave the column newiris$Species empty, and use this dataset as the test dataset

Run Kmean clustering analysis on the dataset newiris, and save the clustering results in kc. In the kmean function, set the number of clusters that need to be generated to

Cluster means: the final average value generated by each column value in each cluster

Clustering vector: the cluster to which each row of records belongs (2 for the second cluster, 1 for * *, 3 for the third cluster)

Within cluster sum of squares by cluster: sum of distance squares within each cluster

(between_SS / total_SS = 88.4%) the sum of distance squares between groups accounts for 88.4% of the total distance squares, that is to say, the distance between clusters is *.

Available components: each component of the object returned by running the kmeans function

("cluster" is an integer vector that represents the cluster to which the record belongs.

"centers" is a matrix that represents the central point of each variable in each cluster

"totss" represents the sum of the total distance squares of the generated cluster

"withinss" represents the sum of squares of distances within each cluster group.

"tot.withinss" represents the total sum of distance squares within the cluster group.

"betweenss" represents the total sum of squares between cluster groups.

"size" represents the number of members in each cluster group)

Create a continuous table and count the number of flowers in three clusters.

The scatter chart is drawn according to the clustering results of * *. The data are the columns "Sepal.Length" and "Sepal.Width" in the result set, and the color is the default color represented by 1mem2 and 3.

Mark the central point of each cluster on the graph

III. DBSCAN

Dynamic clustering often leads to a somewhat round or oval cluster. The algorithm based on density scanning can solve this problem. The idea is to set a radius of distance, to determine at least how many points there are, and then connect all the points that can be reached to determine that they are of the same kind. Implementation in r

Where eps is the radius of the distance and minpts is the minimum number of points. Whether scale is standardized (I guess), method has three values, raw,dist,hybird, which means that the data is the original data to avoid calculating the distance matrix, the data is the distance matrix, the data is the original data, but part of the distance matrix is calculated. Showplot does not draw, 0 does not draw, 1 and 2 both draw. Countmode, you can fill in a vector to show the progress of the calculation. Try it with irises.

This is the end of the introduction of clustering methods in R language. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.