What and how to realize the principle of k-means and KMULMENANG + in python 04/22 Update SLTechnology News&Howtos

What and how to realize the principle of k-means and KMULMENANG + in python

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "what is the principle and how to realize the principle of k-means and Kmurm in python". In daily operation, I believe that many people have doubts about what and how to realize the principle of k-means and Kmurm in python. The editor has consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "what is and how to realize the principle of k-means and Kmurm in python"! Next, please follow the editor to study!

Preface

K-means algorithm is an unsupervised clustering algorithm, which is relatively simple to implement. Kmurmeansclustering + can be understood as an enhanced version of k-means, which is more friendly than k-means in initializing the central point.

K-means principle

The implementation steps of k-means are as follows:

K points are randomly selected from the sample as the clustering center.

For any sample point, calculate the distance to k cluster centers, and then classify the sample points to the cluster center with the smallest distance until all the sample points are classified (clustering into k classes).

The average value of each cluster is calculated, and then the k mean values are taken as the new central points of the respective clusters.

Repeat steps 2 or 3 until the position of the center point does not change or the position change of the center point is less than the threshold.

Advantages:

The principle is simple and easy to implement.

The convergence speed is fast and the clustering effect is better.

Disadvantages:

The selection of the initial center point is random, and a bad initial value may be selected.

The principle of Kmurmeanschu +

KMurmeanskeeper + is the enhanced version of k-means, which initially selects clustering centers as scattered as possible, which can effectively reduce the number of iterations and speed up the operation. The implementation steps are as follows:

Randomly select a point from the sample as the clustering center

Calculate the distance from each sample point to the selected cluster center, and use D (X) to express: the larger D (X), the greater the probability that the next cluster center will be selected.

The next cluster center is selected by roulette method (the larger the D (X) is, the greater the probability of the cluster center being selected is)

Repeat step 2 until k cluster centers are selected

After selecting k clustering centers, use the standard k-means algorithm to cluster

It has to be pointed out here that in some literature, the point with the maximum distance from the selected cluster center is selected as the next center point, which is not very accurate. It is accurate to say that the point with the maximum distance from the selected cluster center has the highest probability of being selected as the next center point, but it is not necessarily a change point, because it is not always good to take the maximum point (encounter special data, for example, a point is far away from all points of a cluster).

General initialization part, always give some random. Because the data are random.

Although it takes extra time to calculate the initial point, the k-mean itself converges quickly during the iteration, so the algorithm actually reduces the computing time.

Now the focus is to use the roulette method to select the next clustering center. We use an example to illustrate how Kmurmeanslips + selects the initial clustering center.

If there are 8 samples in the dataset, the distribution and the corresponding sequence number are shown in the following figure:

We first select point 6 as the first cluster center using step 1 of kmurmeanskeeper +, and then proceed to the second step to calculate the distance D (X) from each sample point to the selected cluster center, as shown below:

D (X) is the distance between each sample point and the selected cluster center (that is, the first cluster center)

P (X) the probability that each sample is selected as the next cluster center

Sum is the cumulative sum of probability P (x), which is used to select the second clustering center by roulette method.

Then execute the third step of kmurmeanskeeper +: use the roulette method to select the next clustering center, the method is to randomly generate a random number between 0 to 1, and determine which interval it belongs to, then the corresponding sequence number of the interval is the second clustering center selected.

In the figure above, the interval of point 1 is [00.20), the interval of point 2 is [0.2,0.20), and the interval of point 4 is [0.65 0.525).

It can be seen intuitively from the above table that the sum of the total probabilities of No. 1, No. 2, No. 3 and No. 4 is 0.9, and these four points happen to be four points far away from the first initial clustering center (that is, Point 6). Therefore, the second clustering center selected will fall on one of these four points, and Point 2 has the highest probability of being selected as the next clustering center.

The code implementation of k-means and KMurmeanskeeper +

The center point selected here is the feature of the sample (not the index). This is done to facilitate calculation, and the selected clustering point (the point around the center point) is the index of the sample.

K-means implementation # defines the Euclidean distance import numpy as npdef get_distance (x1, x2): return np.sqrt (np.sum (np.square (x1-x2) import random# defines the center initialization function The center point is selected by the sample feature def center_init (k, X): n_samples, n_features = X.shape centers = np.zeros ((k, n_features)) selected_centers_index = [] for i in range (k): # each cycle randomly selects a category center Judge not to let centers repeat sel_index = random.choice (list (set (range (n_samples))-set (selected_centers_index)) centers [I] = X [sel _ index] selected_centers_index.append (sel_index) return centers# to judge which center point a sample point is close to Return the index of the center point # # for example, if there are three center points, it returns 0Jing 1jue 2def closest_center (sample, centers): closest_i = 0 closest_dist = float ('inf') for I, c in enumerate (centers): # according to the Euclidean distance, select the center point with the minimum distance to belong to the category distance = get_distance (sample, c) if distance

< closest_dist: closest_i = i closest_dist = distance return closest_i# 定义构建聚类的过程# 每一个聚类存的内容是样本的索引，即对样本索引进行聚类，方便操作def create_clusters(centers, k, X): clusters = [[] for _ in range(k)] for sample_i, sample in enumerate(X): # 将样本划分到最近的类别区域 center_i = closest_center(sample, centers) # 存放样本的索引 clusters[center_i].append(sample_i) return clusters# 根据上一步聚类结果计算新的中心点def calculate_new_centers(clusters, k, X): n_samples, n_features = X.shape centers = np.zeros((k, n_features)) # 以当前每个类样本的均值为新的中心点 for i, cluster in enumerate(clusters): # cluster为分类后每一类的索引 new_center = np.mean(X[cluster], axis=0) # 按列求平均值 centers[i] = new_center return centers# 获取每个样本所属的聚类类别def get_cluster_labels(clusters, X): y_pred = np.zeros(np.shape(X)[0]) for cluster_i, cluster in enumerate(clusters): for sample_i in cluster: y_pred[sample_i] = cluster_i #print('把样本{}归到{}类'.format(sample_i,cluster_i)) return y_pred# 根据上述各流程定义kmeans算法流程def Mykmeans(X, k, max_iterations,init): # 1.初始化中心点 if init == 'kmeans': centers = center_init(k, X) else: centers = get_kmeansplus_centers(k, X) # 遍历迭代求解 for _ in range(max_iterations): # 2.根据当前中心点进行聚类 clusters = create_clusters(centers, k, X) # 保存当前中心点 pre_centers = centers # 3.根据聚类结果计算新的中心点 new_centers = calculate_new_centers(clusters, k, X) # 4.设定收敛条件为中心点是否发生变化 diff = new_centers - pre_centers # 说明中心点没有变化，停止更新 if diff.sum() == 0: break # 返回最终的聚类标签 return get_cluster_labels(clusters, X)# 测试执行X = np.array([[0,2],[0,0],[1,0],[5,0],[5,2]])# 设定聚类类别为2个，最大迭代次数为10次labels = Mykmeans(X, k = 2, max_iterations = 10,init = 'kmeans')# 打印每个样本所属的类别标签print("最后分类结果",labels)## 输出为 [1. 1. 1. 0. 0.]# 使用sklearn验证from sklearn.cluster import KMeansX = np.array([[0,2],[0,0],[1,0],[5,0],[5,2]])kmeans = KMeans(n_clusters=2,init = 'random').fit(X)# 由于center的随机性，结果可能不一样print(kmeans.labels_)k-means++实现## 得到kmean++中心点def get_kmeansplus_centers(k, X): n_samples, n_features = X.shape init_one_center_i = np.random.choice(range(n_samples)) centers = [] centers.append(X[init_one_center_i]) dists = [ 0 for _ in range(n_samples)] # 执行 for _ in range(k-1): total = 0 for sample_i,sample in enumerate(X): # 得到最短距离 closet_i = closest_center(sample,centers) d = get_distance(X[closet_i],sample) dists[sample_i] = d total += d total = total * np.random.random() for sample_i,d in enumerate(dists): # 轮盘法选出下一个聚类中心 total -= d if total >

0: continue # Select a new center point centers.append (X [sample _ I]) break return centersX = np.array ([[0quotient 2], [0rect 0], [1jue 0], [5je 0], [5jue 2]]) # set the clustering category to 2 The maximum number of iterations is 10 labels = Mykmeans (X, k = 2, max_iterations = 10 kmeans++' init = 'kmeans++') print ("final Classification result", labels) # # output is [1. 1. 1. 0. 0.] # use sklearn to verify X = np.array ([[0mem2], [0mem0], [1sco0], [5Power0], [5Power0]]) kmeans = KMeans. Fit (X) print (kmeans.labels_). This is the end of the study on "what and how to implement the principles of k-means and Klim meanshammer + in python". Hope to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.