How to implement a Kmeans mean clustering algorithm in Python 07/06 Update SLTechnology News&Howtos

How to implement a Kmeans mean clustering algorithm in Python

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to implement a Kmeans mean clustering algorithm in Python? in order to solve this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

The first step。 Randomly generate centroids

Because this is an unsupervised learning algorithm, we first randomly give a bunch of points under a two-dimensional coordinate axis, and then two centroids are given. the purpose of our algorithm is to divide this pile of points into two categories according to their own coordinate characteristics, so we select two centroids, and when this pile of points can be divided into two piles according to the two centroids. As shown in the following figure:

The second step. Classify according to distance

The red and blue dots represent our randomly selected centroids. Since we want to divide this pile of points into two piles and make each pile closest to its centroid, let's first find out the distance between each point and the centroid. If there is a point that is closer to the red centroid than the blue centroid, we classify the point as the red centroid and vice versa as the blue centroid, as shown in the figure:

The third step. Find the mean value of the same kind of point and update the position of the center of mass.

In this step, we average the value of x\ y of the same type of point and find out the average value of the sum of all the points. This value (xmemy) is the position of our new centroid, as shown in the figure:

We can see that the position of the center of mass has changed.

Step four. Repeat step two, step three.

We repeat the operations of the second and third steps, constantly find the minimum value of the centroid and then classify it, and then update the position of the centroid until we get the upper limit of the number of iterations (for example, 10000), or after n iterations, the position of the centroid of the last two iterations remains the same, as shown in the following figure:

At this time, we divided this pile of points into two categories without supervision according to their characteristics.

five。 In the case of a point determined by multiple features, how to achieve clustering? First of all, we introduce a concept, that is, Euclidean distance, Euclidean distance is defined in this way, it is easy to understand:

Obviously, the Euclidean distance d (xi,xj) is equal to the characteristic of each point minus the sum of the square of the distance of another point under that dimension and then open the root sign, which is very easy to understand.

We can also understand the kmeans algorithm in another way, that is, clustering is achieved by minimizing the variance of one point and some other points, as shown in the following figure:

Got it!

Six: code implementation

We now use Python to implement this kmeans mean algorithm. First we import a dataset called make_blobs datasets, and then use the two variables X and y to receive it. X means the data we get, y indicates which category the data should be classified into, of course, in our actual data, it will not tell us which data is divided into which category, only the data in X. It is special to write code here. The make_blobs library requires that we must accept these two parameters, not just X as a data parameter. The code is as follows

Plt.figure (figsize= (15jue 15)) # stipulates that the size of our drawing is 12cm 12x, y=make_blobs (nasty samples 1600 recording states 170) # takes a total of 1600 sample, while the status is set to random # I don't know what this state means random, so I can only check the official documents about the library, and this data set specifies that there are three data centers. That is, three clusters y_pred=KMeans. Fit_predict (X) plt.subplot (221) # represents the first of the four squares plt.scatter (X [:, 0], X [:, 1], c=y_pred) # represents the 0th and first dimension of the data At the same time, the colour of the data is related to the results of predict. Plt.title ("The result of the Kmeans") plt.subplot (222) # represents the first lattice of the four squares, plt.scatter (X [:, 0], X [:, 1], plt.title ("The Real result of the Kmeans") array=np.array ([[0.60834549, y_pred=KMeans 0.63667341], [- 0.40887178]) lashen=np.dot (n_clusters=3) Random_state=170) .fit_predict (lashen) plt.subplot (223) # represents the first of the four squares plt.scatter (lashen [:, 0], lashen [:, 1], c=y_pred) # represents the 0th and 1st dimensions of the data At the same time, the colour of the data is related to the result of predict about plt.title ("The Real result of the tranfored data").

When we use the scatter function to draw, we will write the corresponding code according to the shape of our data knot. The number of rows in the X dataset we get here is the 1600 rows we specified, because we have a total of 1600 data, each of which has only two features, that is, the coordinates in the XY axis, so X is a two-dimensional ndarray object (X is the ndarray object in numpy). We can print it out to see the composition of this data, as shown in the following figure:

At the same time, we can also see that y is also a ndarray object, because we only accept three clusters when collecting data, and make_blobs accepts three clusters (or cluster) by default, so there are only three possibilities for the value of y. Through matplotlib drawing, we draw the result diagram of our classification, that is, the running result of the above code is as follows:

The answer to the question about how to achieve a Kmeans mean clustering algorithm in Python is shared here. I hope the above content can be of some help to everyone. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.