What is the hierarchical clustering method of R language? 07/12 Update SLTechnology News&Howtos

What is the hierarchical clustering method of R language?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of what the hierarchical clustering method of R language is, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe you will gain something after reading this article on hierarchical clustering method of R language. Let's take a look at it.

A Survey of clustering algorithms

Cluster analysis (clustering analysis) is to divide a group of objects into different cluster according to their characteristics, so that the objects in the same cluster are more similar than those between different cluster in a sense.

Because there is no clear definition of "cluster", there will be clustering algorithms based on different models, among which the widely used clustering algorithms are as follows:

The clustering algorithm based on connected model (connectivity-based), that is, the hierarchical clustering algorithm to be discussed in this paper, its core idea is to cluster according to the distance between objects, two objects close to each other are more likely to belong to the same cluster than two objects far away.

Clustering algorithm based on central point model (centroid-based): in this kind of algorithm, each cluster maintains a central point (centorid), which does not necessarily belong to a given data set. When the clustering number k is specified in advance, this kind of algorithm needs to solve an optimization problem, the objective function is the sum of the square of the distance between all objects and the center point of the cluster to which they belong, and the optimization variable is the center point of each cluster and which cluster; each object belongs to. This optimization problem is proved to be NP-hard, but there are some iterative algorithms that can find approximate solutions, and k-means algorithm is one of them.

Clustering algorithm based on distribution model (distribution-based): this kind of algorithm thinks that the data in the data set is sampled by a mixed probability model, so as long as the data that may belong to the same probability distribution is classified as the same cluster, the most commonly used algorithm is Gaussian mixture model (GMM) clustering.

Density-based clustering algorithm: in this kind of algorithm, the region with high density is classified as a cluster,cluster separated by low-density region, and the points in the low-density region are regarded as noise. The commonly used density clustering algorithms are DBSCAN and OPTICS.

A Summary of hierarchical clustering

Hierarchical clustering algorithm (hierarchical clustering) divides the data set into a layer-by-layer clusters, and the clusters generated by the latter layer is based on the results of the previous layer. Hierarchical clustering algorithms are generally divided into two categories:

Agglomerative hierarchical clustering: also known as bottom-up (bottom-up) hierarchical clustering, each object starts with a cluster, merging the two closest cluster according to certain criteria to form a new cluster, and so on, until finally all objects belong to a cluster. This paper mainly focuses on this kind of algorithm.

Divisive hierarchical clustering: also known as top-down (top-down) hierarchical clustering, at the beginning, all objects belong to a cluster, each time a cluster is divided into multiple cluster according to certain criteria, and so on, until each object is a cluster.

The following figure intuitively shows the idea of hierarchical clustering and the similarities and differences of the above two clustering strategies.

In addition, it should be pointed out that hierarchical clustering algorithm is a greedy algorithm (greedy algorithm), because each merge or partition is based on some local optimal choice.

Tree diagram

Tree diagram (dendrogram) can be used to directly represent the results of hierarchical clustering. A tree with five points is shown in the following figure, where the ordinate height represents the distance between different cluster (the criteria for measuring "distance" can be varied, see the definition later in this article). You can see from this diagram that the distance between x1x1 and x2x2 is the closest (1), so x1x1 and x2x2 are merged into a cluster (x1Magnex2) (x1Magnex2), so the nodes x1x1 and x2x2 are connected first in the tree diagram. Make it a child of a new node (x1Powerx2) (x1recoveryx2) and set the height of the new node to 1 After that, the two nearest cluster are selected from the remaining four cluster (x1recorder x2), x3x3, x4x4 and x5x5. The distance between x4x4 and x5x5 is the closest (2). Therefore, x4x4 and x5x5 are merged into a cluster (x4magin x5) (x4pyrrine x5), which is reflected in the tree diagram, which connects the node x4x4 and x5x5 to become the child nodes of a new node (x4memx5) (x4memx5). And set the height of this new node to 2 ... . The tree graph is generated according to this pattern, until there is only one cluster ((x1Powerx2), x3), (x4recoverx5)) ((x1memx2), x3), (x4Powerx5)).

It can be seen intuitively that if we want to get a clustering result, so that the distance between any two cluster is not greater than hh, we only need to make a horizontal straight line on the tree graph to cut it, so that its ordinate is equal to hh, and the corresponding clustering result can be obtained. For example, in the following tree diagram, if you set h=2.5h=2.5, you can get three cluster (x1Maginx2) (x1Personx2), x3x3 and (x4Personx5) (x4Maginx5).

Measurement of distance between objects

There are many ways to measure the distance between two objects. For numerical type (Numerical) data, the commonly used distance measurement criteria are Euclidean distance, Manhattan distance, Chebyshev distance, Minkowski distance and so on. For the two objects in the dd dimensional space, x = [x1m x2, … , xd] Tx= [x1J x2, … , xd] T and y = [y1jiny2, … , yd] Ty= [y1jiny2,... , yd] T, its distance calculation method under different distance criteria is shown in the following table:

Euclidean distance d (x∑ y) = [∑ dj=1 (xj − yj) 2] 12 = [(x − y) T (x − y)] 12d (x − y) = [∑ jockey 1d (xj − yj) 2] 12 = [(x − y) T (x − y)] 12Manhattan distance d (x − y) = ∑ dj=1 ∣ xj − yj ∣ d (x Mague y) = ∑ jockey 1d ∣ xj − yj ∣ Chebyshev distance d (x Mague y) = max1 ≤ j ≤ d ∣ xj yj d (x Mague y) Y) = max1 ≤ j ≤ d ∣ xj − yj ∣ Minkowski distance d (x ≥ y) = [∑ dj=1 (xj − yj) p] 1p ≥ 1d (x ≥ y) = [∑ jacks 1d (xj − yj) p] 1p ≥ 1

The Minkowski distance is the LpLp norm (p ≥ 1p ≥ 1), while the Manhattan distance, the Euclidean distance and the Chebyshev distance correspond to the case of pair1Magi 2, ∞ pendant 1Magne2 and ∞, respectively.

Another common distance is the Maholanobis distance, which is defined as follows:

Dmah (x− y) = (x − y) T Σ − 1 (x − y) −√ dmah (x − y) = (x − y) T Σ − 1 (x − y)

Where Σ is the covariance matrix of the data set, in order to give the definition of the covariance matrix, we first give some settings of the data set, assuming that the data set is X = (x1 ~ x2, … , xn) ∈ Rd × nX= (x1jue x2, … , xn) ∈ Rd × NJI Xi ∈ Rdxi ∈ Rd is the ii sample point, the dimension of each sample point is dd, and the total number of sample points is nn; and then assume that the average mx=1n ∑ ni=1ximx=1n ∑ i=1nxi of the sample points is 00 vector (if it is not 00, we can always subtract the average value of each sample to get the data set that meets the requirements), then the covariance matrix Σ ∈ Rd × d Σ ∈ Rd × d can be defined as

Σ = 1nXXT Σ = 1nXXT

Maholanobis distance is different from Minkowski distance, which measures the absolute distance between two objects, and its value is not affected by the overall distribution of the data set. The Maholanobis distance measures the degree of difference between two objects in the whole dataset. The Maholanobis distance of two objects with larger absolute distance in a scattered dataset is likely to be smaller than that of two objects with smaller absolute distance in a densely distributed dataset. In more detail, the Maholanobis distance is calculated as follows: first, the principal component analysis of the data is carried out to extract irrelevant features, then the objects to be calculated are projected on these features to get a new data, and then a weighted Euclidean distance is calculated between these new data, and the weight of each feature is inversely proportional to the variance of the feature in the data set. From these decoherence and normalization operations, we can see that the Maholanobis distance remains unchanged by any reversible transformation of the data.

Distance measurement between Cluster

In addition to measuring the distance between objects, hierarchical clustering algorithm also needs to measure the distance between cluster. The common measurement methods between cluster are Single-link method, Complete-link method, UPGMA (Unweighted Pair Group Method using arithmetic Averages) method, WPGMA (Weighted Pair Group Method using arithmetic Averages) method, Centroid method (also known as UPGMC,Unweighted Pair Group Method using Centroids), Median method (also known as WPGMC,weighted Pair Group Method using Centroids), Ward method. The first four methods are graph-based, because in these methods, the cluster is represented by sample points or some sub-cluster (the distance between these sample points or sub-cluster is recorded and can be thought of as the connected edges of the graph); the latter three methods are based on geometric methods (so the distance between objects is generally calculated by Euclidean distance), because they all use a central point to represent a cluster. Assuming that CiCi and CjCj are two cluster, the distance between CiCi and CjCj defined by the first four methods is shown in the following table:

Single-linkD (Ci,Cj) = minx ∈ Ci,y ∈ Cjd (XMagy) D (Ci,Cj) = minx ∈ Ci,y ∈ Cjd (xMagy) Complete-linkD (Ci,Cj) = maxx ∈ Ci,y ∈ Cjd (xMagy) D (Ci,Cj) = maxx ∈ Ci,y ∈ Cjd (xMague y) UPGMAD (UPGMAD) = 1 ∣ ∣∣ UPGMAD UPGMAD ∣∑ x ∈ Ci,Cj ∈ Ci,Cj (xMagy) D (Ci,Cj) = 1 ∣ ∣∣ Ci ∣∑ x ∈ Ci ∈ (xpeny)

Single-link defines the distance between two cluster as the distance between the two objects with the closest distance between two cluster, so the chain effect may occur in the process of clustering, that is, it is possible to aggregate cluster in the shape of long bars. On the other hand, Complete-link defines the distance between two cluster as the distance between the two objects with the farthest distance between two cluster. Although it avoids the chain effect, it is very sensitive to abnormal sample points (noise points that do not conform to the overall distribution of the data set) and is prone to unreasonable clustering. UPGMA happens to be a compromise between Single-link and Complete-link, which defines the distance between two cluster as the average distance between two objects between two cluster WPGMA, on the other hand, calculates the weighted average of the distance between two objects between two cluster. The purpose of weighting is to make the influence of two cluster on distance calculation at the same level, without being affected by the size of cluster. (the calculation method is not given here, because when we run the hierarchical clustering algorithm, we do not calculate the distance between two cluster directly through the distance between sample points. Instead, the distance between the merged new cluster and the remaining cluster is calculated through the distance between the existing cluster, which will be given by the Lance-Williams method in the next section.

The Centroid/UPGMC method calculates a centroid for each cluster, and the distance between two cluster is the distance between the corresponding two centroids. The general calculation method is as follows:

DUPGMC (Ci,Cj) = 1 ∣ Ci ∣∣ Cj ∣∑ x ∈ Ci,y ∈ Cjd − 12 ∣ Ci ∣ 2 ∑ x

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.