How to make data grouped automatically by K-means algorithm 07/06 Update SLTechnology News&Howtos

How to make data grouped automatically by K-means algorithm

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is about how the K-means algorithm allows data to be grouped automatically. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it with the editor.

The K-means algorithm to be introduced below is an unsupervised learning.

Compared with the classification algorithm, the unsupervised learning algorithm is also called clustering algorithm, which means that there is only feature data but no target data, so that the algorithm automatically "learns knowledge" from the data and aggregates different categories of data into the corresponding categories.

1D K-means algorithm

K-means is K-Means in English, which means:

K: indicates that the algorithm can divide the data into K different groups.

Mean: indicates that the center point of each group is the average of all values in the group.

The K-means algorithm can divide an unclassified data set into K classes. The class into which a data should be divided is determined by the similarity between the data and the central point of the group, that is, the data is most similar to the central point of the group, then the data should be divided into which class.

About how to calculate the similarity between things, you can refer to the article "how computers understand the relevance of things".

The general steps to use the K-means algorithm are:

Determine what the K value is:

For the choice of K value, you can analyze the data, the estimated data should be divided into several categories.

If it is impossible to estimate the exact value, you can try a few more K values, and finally take the K value with the best partition effect as the final choice.

Select K center points: generally speaking, the first K center points are randomly selected.

All the data in the data set is divided into different categories according to the similarity with the central point.

Recalculate the location of the center point of each category based on the average of the data in the category.

The loop iterates through steps 3 and 4, until the position of the center point is almost unchanged, and the classification process is completed.

2the clustering process of the divine K-means algorithm

Let's take a clustering process of two-dimensional data points to see how the K-means algorithm clustering.

First, there are some discrete data points, as shown in the following figure:

We use K-means algorithm to cluster these data points. Randomly select two points as the center points of the two classes, red x and blue x:

Calculate the distance of all data points to the two center points, the point close to red x is red, and the point close to blue x is blue:

Recalculate the position of the two center points, which are moved to the new location:

Recalculate the distance of all data points to red x and blue x, respectively, the point near red x is red, and the point close to blue x is blue:

The positions of the two center points are calculated again, and the two center points are moved to the new location:

Until the position of the central point almost does not change, the clustering ends.

The above process is the clustering process of K-means algorithm.

Implementation of 3Di K-means algorithm

K-means algorithm is a clustering algorithm. Cluster module in sklearn library implements a series of clustering algorithms, including K-means algorithm.

Take a look at the prototype of the KMeans class:

KMeans (n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='deprecated', verbose=0, random_state=None, copy_x=True, algorithm='auto')

You can see that the KMeans class has many parameters. Here are some of the more important parameters:

N_clusters: K value. You can set some K values randomly and choose the one with the best clustering effect as the final K value.

Init: how to select the initial center point:

Init='k-means++': can accelerate the speed of convergence, which is the default way and a better way.

Init='random': randomly select the center point.

You can also customize the way, which is not introduced here.

N_init: the number of operations to initialize the central point. The default is 10. If the value of K is relatively large, you can increase the value of n_init appropriately.

There are three algorithms for implementing algorithm:k-means: auto,full,elkan.

The default is auto, and full or elkan is automatically selected according to the characteristics of the data.

Max_iter: the maximum number of iterations of the algorithm. The default is 300.

If clustering is difficult to converge, setting the maximum number of iterations can make the algorithm finish as soon as possible.

Let's cluster some points in two-dimensional coordinates to see how to use the K-means algorithm.

4, prepare the data point

The following are three types of randomly generated coordinate points, each with 20 points, and the coordinates of different types of points are in different ranges:

Class A points: Ax represents the Abscissa of Class A points, and Ay represents the ordinates of Class A points. The range of horizontal and vertical coordinates is (0,20].

Class B point: Bx represents the Abscissa of class B point, and By represents the ordinate of class B point. The range of horizontal and vertical coordinates is (40,60).

Class C points: Cx represents the Abscissa of Class C points, and Cy represents the ordinates of Class C points. The range of horizontal and vertical coordinates is (70,90).

Ax = [20, 6, 14, 13, 8, 19, 20, 14, 2, 11, 2, 15, 19, 4, 4, 11, 13, 4, 15, 11] Ay = [14, 19, 17, 16, 3, 4, 12, 9, 17, 14, 1, 18, 17, 3, 5] Bx = [53, 50, 46, 52, 57, 42, 47, 55, 56, 57, 56, 50, 46, 46, 44, 44, 58, 54, 47. 57] By = [60, 57, 57, 53, 54, 45, 54, 57, 49, 53, 42, 59, 54, 53, 50, 58, 58, 51] Cx = [77, 75, 71, 87, 74, 70, 74, 85, 71, 75, 72, 82, 81, 70, 72, 71, 88, 71, 72, 80] Cy = [85, 77, 82, 87, 71, 71, 77, 88, 81, 73, 80, 72, 90, 77, 89, 88, 83, 77, 90, 72]

We can use Matplotlib to draw these points in two-dimensional coordinates, as follows:

Import matplotlib.pyplot as pltplt.scatter (Ax + Bx + Cx, Ay + By + Cy, marker='o') plt.show ()

The following figure shows that the distribution range of these three types of points is clear at a glance.

For information about how to use Matplotlib to draw, you can refer to the article "how to use Python for data visualization."

5, clustering the data

The following uses the K-means algorithm to cluster the data points.

Create a K-means model object:

From sklearn.cluster import KMeans# sets K to 3, and other parameters use the default value kmeans = KMeans (n_clusters=3)

Prepare data, a total of three categories, 60 coordinate points:

Train_data = [# the first 20 are A points [20,14], [6,19], [14,17], [13,16], [8,3], [19,7], [20,9], [14,18], [2,20], [11,3], [2,4], [15,12], [19,9], [4,17], [4,14], [11] 1], [13, 18], [4, 17], [15, 3], [11, 5], # 20 points in the middle are B points [53, 60], [50, 57], [46, 57], [52, 53], [57, 54], [42, 45], [47, 54], [55, 57], [56, 49], [57, 53], [56, 42], [50, 59] [46, 54], [46, 53], [44, 50], [44, 50], [58, 58], [54, 58], [47, 58], [57, 51], # the last 20 are C points [77, 85], [75, 77], [71, 82], [87, 87], [74, 71], [70, 71], [74, 77], [85, 88] [71, 81], [75, 73], [72, 80], [82, 72], [81, 90], [70, 77], [72, 89], [71, 88], [88, 83], [71, 77], [72, 90], [80, 72],]

Fit model:

Kmeans.fit (train_data)

Clustering data:

Predict_data = kmeans.predict (train_data)

Check the clustering results, where 0pl-1-j-2 represents different categories:

> print (predict_data) [0 000 000 0 000 0 000 0 0 0 2 2 2 1 1 1 11]

By observing the final clustering result predict_data, we can see that the first, middle and last 20 data are divided into different classes, which is very consistent with our expectations, indicating that the clustering result of K-means algorithm is still very good.

Because the distribution boundary of the two-dimensional coordinate points in this example is very obvious, the final clustering result is very good.

We can view the number of iterations through the niteriterate _ property:

> kmeans.n_iter_2

View the center point coordinates of each class through the cluster_centers_ property:

> kmeans.cluster_centers_array ([[11.25,11.3], [75.9,80.5], [50.85,53.6])

Draw these three central points in the coordinate axis as follows:

This is how the K-means algorithm makes the data grouped automatically. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.