How to use Python for systematic clustering Analysis 07/06 Update SLTechnology News&Howtos

How to use Python for systematic clustering Analysis

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to use Python for system clustering analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

In machine learning, we often have to cluster the data, clustering, to put it bluntly, is to classify the similar sample points / data points, the sample points with high similarity will be put together, so a sample will be divided into several categories. There are also many methods of clustering analysis, such as decomposition method, addition method, clustering of ordered samples, fuzzy clustering method and systematic clustering method. What this article wants to introduce is the systematic clustering method, and how to use python to carry on the systematic clustering analysis.

First of all, let's take a look at the definition of system clustering. Systematic clustering method (hierarchical clustering method), also known as hierarchical clustering method, is the most commonly used clustering analysis method. The basic steps are as follows: assuming that there are n samples in the sample, then first regard the n samples as n classes, that is, a sample and a class, and then merge the two classes with the closest properties into a new class, so as to get the nmur1 class, and then find out the closest two classes and let them merge, so as to become the nmur2 classes, so that the process can continue. Finally, all the samples are grouped into one category, and the above process is drawn into a graph, which is called a cluster diagram, from which the number of categories is determined. The approximate process is shown in figure 1.

Figure 1. Schematic diagram of systematic cluster analysis

Here we have to determine the similarity of each sample before it can be classified, so how to determine its similarity? Usually the method we use is to calculate the distance between the sample points, and then classify them according to the distance. Here we classify according to distance, there are also several methods, such as the shortest distance method, the longest distance method, the center of gravity method, the class average method and the ward method. Let's give a brief introduction to these methods.

1. Shortest distance method

The shortest distance method is to find the two sample points with the shortest distance from the two classes, as shown in figure 2. Points 3 and 7 are the two shortest points in class G1 and class G2. The calculation formula is shown in figure 4.

Figure 2. Schematic diagram of the shortest distance method

two。 Maximum distance method

The longest distance method is to find the two sample points with the longest distance from the two classes, as shown in figure 3. Points 1 and 6 are the two longest points in class G1 and class G2. The calculation formula is shown in figure 4.

Figure 3. Schematic diagram of the longest distance method

3. Barycentric method

From a physical point of view, it is reasonable for a class to use its center of gravity, that is, the average value of the class sample, to represent it, and the distance between classes is the distance between the center of gravity. If the Euclidean distance between the samples is used, let G _ 1 and G _ 2 be merged into G _ 3 in a certain step, and they each have n _ 1, n _ 2 and n three samples, in which n3=n1+n2, their center of gravity is represented by X _ 1, X _ 2 and X _ 3, then X3=1/n3 (n1X1+n2X2). The calculation formula of the center of gravity method is shown in figure 4.

4. Class average method

As the name implies, this is the average of all the distances between the two classes. The calculation formula is shown in figure 4.

Figure 4. Common methods of distance calculation

5. Sum of squares method of deviation

The method of sum of squares of deviation is also called Ward method, and its idea comes from the analysis of variance, that is, if the class is divided correctly, the sum of squares of the same kind of samples should be smaller, and the sum of squares of deviations between classes should be larger. The calculation formula is shown in figure 4.

After understanding the basic knowledge of system clustering, we use python code to show the specific use of system clustering.

First of all, import the library.

Import numpy as np from matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage

The next step is to generate the dataset. The dataset we use this time is randomly generated, and the number is small, a total of 15 data points, divided into two data clusters, one has 7 data points, the other has 8 data points. The reason why the number of data points is so small is that it is easy to see the data distribution and the classification of the pictures when drawing later. The code is as follows.

State = np.random.RandomState (99) # set random state a = state.multivariate_normal ([10,10], [1,3], [3,11]], size=7) # generate multivariate normal variables b = state.multivariate_normal ([- 10,-10], [1,3], [3,11]], size=8) data = np.concatenate ((a, b)) # splice the data

Here we set a random state to facilitate repeated experiments. Then use this random state to generate two variables an and b, which are the data clusters mentioned earlier, a has 7 data points, b has 8, an and b are multivariate normal variables, where the mean vector of an is [10,10], the mean vector of b is [- 10,-10], and the covariance matrix is [[1,3], [3,11]]. It should be noted here that the covariance matrix is a positive definite matrix or a positive semidefinite matrix. Then concatenate an and b to get the variable data.

The next step is to plot the distribution of the data points. The code is as follows.

Fig, ax = plt.subplots (figsize= (8Power8)) # set the picture size ax.set_aspect ('equal') # set the ratio of the two axes to equal plt.scatter (data [:, 0], data [:, 1]) plt.ylim ([- 30 Magazine 30]) # set the Y axis value range plt.xlim ([- 30 Magne30]) plt.show ()

Here the code is relatively simple, do not repeat, mainly talk about ax.set_aspect ('equal') this line, because matplotlib default x-axis and y-axis ratio is different, that is, the same unit length of line segments, the length is not the same, so to set the ratio to the same, so that the picture looks more coordinated and more accurate. The picture drawn is shown in figure 5, from which you can clearly see two data clusters, the upper one is probably concentrated near the coordinate point [10,10], and the lower one is probably concentrated near [- 10,-10], which is the same as what we set up. It is obvious from the figure that this data set can be divided into two categories, that is, the upper data cluster is divided into one category, and the lower data cluster is divided into another category, but we still need to calculate it through the algorithm.

Figure 5. Data distribution map used

Then there is data processing, the code is as follows.

Z = linkage (data, "average") # uses the average algorithm, that is, the class average method

There is only one line of code for data processing, which is very simple, but that's the difficulty. First let's take a look at the results of z, as shown in figure 6.

Figure 6. Clustering calculation result

Many people look confused or even confused when they see this result for the first time, but the reason is actually very simple. Scipy officially has some settings for this. For example, the first row of the result has four numbers, namely 11, 13, 0.14740505 and 2. The first two numbers refer to "class". At the beginning, each point is a class, so 11 and 13 are two classes, and the third number 0.14740505 is the distance between these two points. These two points are merged into one class, so the new class contains two points (11 and 13). This is the number 2 of the fourth point, and this new class is counted as class 15. Note that this is class 15, not class 15, because there are 15 points in our original dataset, which are class 0, class 1, and class 2 in order. Class 14, because python starts at 0, so class 15 refers to the 16th class. Z in the second row of data, the first two numbers are 2 and 5, the original class 2 and class 5, the distance is 0.3131184, including 2 points, this row of data is similar to the first row. Then look at the third row of data, the first two numbers are 10 and 15, that is, class 10 and class 15, class 15 is the new class merged into the first row before, it contains 11 and 13 these two points, the distance between class 15 and class 10 is 0.39165998, this number is the average distance between class 11 and 13 and class 10, because the algorithm we use here is average, classes 10, 11 and 13 are merged into a new class, which contains 3 points, so the fourth number is 3. Data for other rows in z follows this rule, and so on. In the last row of data, classes 26 and 27 are merged into a new class, which contains all 15 points, that is, the 15 points are eventually divided into one class, and the algorithm terminates.

Then there is the drawing, the code is as follows, and the result is shown in figure 7.

Fig, ax = plt.subplots (figsize= (8Power8)) dendrogram (z, leaf_font_size=14) # drawing plt.title ("Hierachial Clustering Dendrogram") plt.xlabel ("Cluster label") plt.ylabel ("Distance") plt.axhline (yellow10) # draw a classification line plt.show ()

Figure 7. Clustering result graph

As can be seen from the figure, the 15 points can be divided into two categories. The green line connected points in front represent one category, that is, the seven points from point 0 to point 6, and the points connected by the red line behind represent the second category, that is, the eight points from point 7 to point 14. We can see that the result of this division is very correct, which is the same as our setting at that time.

The algorithm of system clustering method is relatively simple and practical, and it is the most widely used clustering method at present, but this method has some shortcomings when dealing with a large amount of data, so it is best to cooperate with other algorithms to use it. At the same time, users should choose the appropriate distance calculation method according to their own situation. This paper mainly uses the class average method for clustering operation, because this data set is very simple, so the result obtained by other distance calculation methods is the same as this. If the amount of data is relatively large, the final results of different distance calculation methods may be different, so users should choose according to their own situation.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.