In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article is about how to use Python language to implement K-Means clustering algorithm. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
1 Overview 1.1 unsupervised learning
In a typical supervised learning, we have a labeled training set, and our goal is to find a way to distinguish between positive
The decision boundary of sample and negative sample, in the supervised learning here, we have a series of labels, which we need to fit into one.
A hypothetical function. In contrast, in unsupervised learning, our data are not labeled with any labels, and what we get
That's what the data looks like:
Here we have a series of points, but no labels. Therefore, our training set can be written as follows:
We don't have any labels. Therefore, there is no label information for these points on the picture. In other words, in non-supervised learning, we need to input a series of unlabeled training data into an algorithm, and then we tell the algorithm to find the internal structure of the data for us. We may need some kind of algorithm to help us find a structure. The data on the graph seems to be divided into two separate sets of points (called clusters). An algorithm that can find these sets of points that I circle is called a clustering algorithm.
This will be the first unsupervised learning algorithm we introduce. Of course, we will also mention other types of non-supervision after that.
Learning algorithms, they can find other types of structures or other patterns for us, not just clusters.
We will first introduce the clustering algorithm. After that, we will introduce other algorithms one after another. So what are clustering algorithms generally used to do?
Where is it?
Such as market segmentation. Maybe you store a lot of customer information in the database, and you want to divide them into different customer groups, so that you can sell products or provide more appropriate services to different types of customers. Social network analysis: in fact, many researchers are working on a group of people, social networks, such as Facebook, Google+, or other information, such as who do you often contact and whom they often email to find people who are close to them. Therefore, this may require another clustering algorithm that you want to use to find close friends in social networks. To study this problem, we hope to use clustering algorithms to better organize computer clusters or better manage data centers. Because if you know that in the data center, those computers often work together. Then, you can reallocate resources and rearrange the network. As a result, the data center and data communication are optimized.
Finally, I am actually studying how to use clustering algorithms to understand the formation of galaxies. Then use this knowledge to understand one.
Some astronomical details. Okay, this is the clustering algorithm. This will be the first unsupervised learning algorithm we will introduce. Next, we will begin to introduce a specific clustering algorithm.
1.2 clustering
1.3 K-Mean mean algorithm
2. K-Mean mean algorithm 2. 1 introduces
K-means is the most popular clustering algorithm, which accepts an untagged data set and then clusters the data into different ones.
Group
Steps:
Set the initial value of the center of K categories
Calculate the distance from each sample to K centers and classify it according to the nearest distance
Update the center of each category with the mean of the samples in that category
Repeat the above steps until the termination condition (number of iterations, minimum square error, rate of change at the center of the cluster) is reached.
Here is an example of clustering:
K-means clustering algorithm:
The pseudo code of the K-means algorithm is as follows:
Repeat {for I = 1 to mc (I): = index (form 1 to K) of cluster centroid closest to x (I) for k = 1 to K μ k: = average (mean) of points assigned to cluster k}
The algorithm is divided into two steps, the first for loop is the assignment step, that is, for each sample I, calculate that it should belong to
The class of Yu. The second for cycle is the shift of the cluster center, that is, for each class K, the centroid of the class is recalculated.
From sklearn.cluster import KMeans # imports the sklearn.cluster.KMeans class import numpy as np X = np.array ([[1jue 2], [1je 4], [1je 0], [10je 2], [10je 4], [10J 0]]) kmCluster = KMeans (n_clusters=2) .fit (X) # to build a model and cluster Set K=2print ("Cluster Center coordinates:", kmCluster.cluster_centers_) # return the coordinates of each cluster center print ("Classification result:", kmCluster.labels_) # return the classification result of the sample set print ("Show Predictive judgment:", kmCluster.predict ([[0,0], [12]) 3])) # predict and judge the cluster center coordinates according to the model clustering results: [[10. 2.] [1. 2.]] Classification result: [111000] shows prediction judgment: [10] Process finished with exit code 02.2 improved algorithm for large sample set: Mini Batch K-Means
For the problem that the sample set is huge, for example, the sample size is greater than 100000 and the characteristic variables are greater than 100, the speed and memory consumption of the Kmurmeans algorithm is very large. SKlearn provides an improved algorithm Mini Batch K-Means for large sample sets, which does not use all the sample data, but selects a small sample set for K-Means clustering and cycle iteration. Although the performance of Mini Batch K-Means degrades slightly, it greatly improves the running speed and memory footprint.
From sklearn.cluster import MiniBatchKMeans # imports .miniBatchKMeans class import numpy as npX = np.array ([[1mem2], [1mem4], [1mem0], [4recorder 2], [4recorder 4], [4je 5], [0pr 1], [2je 2], [3je 2], [5je 5], [1m mer 1]]) # fit on the whole datambkmCluster = MiniBatchKMeans (nasty clusters3, batchhands sizeboxes 6). Fit (X) print ("coordinates of cluster centers:" MbkmCluster.cluster_centers_) # returns the coordinate print of each cluster center ("Classification result of the sample set:", mbkmCluster.labels_) # returns the classification result print of the sample set ("display the judgment result: which category the sample belongs to:", mbkmCluster.predict ([[0mem0]) ]) # predict the coordinates of the clustering center according to the clustering results of the model: [[2.55932203] [0.75862069-0.20689655] [4.20588235 1.76271186]] the classification result of the sample set: [2.55932203 02 2 002 1] shows the judgment result: which category does the sample belong to: [12] Process finished with exit code 02.3 image From sklearn.cluster import kmeans_plusplusfrom sklearn.datasets import make_blobsimport matplotlib.pyplot as plt # Generate sample datan_samples = 4000n_components = 4 X Y_true = make_blobs (n_samples=n_samples, centers=n_components, cluster_std=0.60, random_state=0) X = X [:,:-1] # Calculate seeds from kmeans++centers_init, indices = kmeans_plusplus (X, n_clusters=4, random_state=0) # Plot init seeds along side sample dataplt.figure (1) colors = ["# 4EACC5", "# FF9C34", "# 4E9A06", "m"] for k Col in enumerate (colors): cluster_data = y_true = = k plt.scatter (X [cluster _ data, 0], X [cluster _ data, 1], c=col, marker= ".) plt.scatter (centers_init [:, 0], centers_init [:, 1], c =" b ", slug 50) plt.title (" Klyn Meansties + Initialization ") plt.xticks ([]) plt.yticks ([]) plt.show ()
3 case 13.1 Code #-*-coding: utf-8-*-import numpy as npimport pandas as pdfrom sklearn.cluster import KMeans, MiniBatchKMeans def main (): # read the data file file = pd.read_excel ('Kmurmeans.xlsx' Header=0) # header row file = file.dropna () # Delete data with missing values # print (file.dtypes) # View the data type of each column of df # print (file.shape) # View the number of rows and columns of df print (file.head ()) # data preparation z_scaler = lambda x: (x-np.mean (x)) / np.std (x ) # define the data standardization function dfScaler = file [['D1'' Apply (z_scaler) # data normalized dfData = pd.concat ([file [['region'], dfScaler], axis=1) # column level merges df = dfData.loc [:, ['D1','D2', "D3", "D4", "D4", "D5", "D7", "D7", "D7" ['D10']] # clustering analysis based on all 10 features # df = dfData.loc [:,] # after dimensionality reduction, six feature clustering analysis X = np.array (df) # is selected to prepare sklearn.cluster.KMeans model data print ("Shape of cluster data:") X.shape) # KMeans Cluster Analysis (sklearn.cluster.KMeans) nCluster = 4 kmCluster = KMeans (n_clusters=nCluster) .fit (X) # build a model and cluster Set Cluster centers 4 print ("Cluster results:\ n", kmCluster.cluster_centers_) # return the coordinates of each cluster center print ("Cluster results:\ n", kmCluster.labels_) # return the classification results of the sample set # sort out the clustering results (great!) ListName = dfData ['region'] .tolist () # convert the first column of dfData 'region' to list dictCluster = dict (zip (listName,kmCluster.labels_)) # associate listName with the clustering result Composition dictionary listCluster = [[] for k in range (nCluster)] for v in range (0 Len (dictCluster): K = list (dictCluster.values ()) [v] # the classification of the v city is k listCluster [k] .append (list (dictCluster.keys ()) [v]) # add the v city to the k class print ("\ nCluster analysis result (divided into {} category):" .format (nCluster)) # returns the classification result of the sample set for k in Range (nCluster): print ("Class {}: {}" .format (k ListCluster [k]) # shows the result of type k return if _ _ name__ = ='_ main__': main () 3.2Result
Area D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
0 Beijing 5.96 310 461 1557 931 319 44.36 2615 2.20 13631
Shanghai 3.39 234 308 1035 498 161 35.02 3052 0.90 12665
2 Tianjin 2.35 157 229 713 295 109 38.40 3031 0.86 9385
3 Shaanxi 1.35 81 111 364 150 58 30.45 2699 1.22 7881
4 Liaoning 1.50 88 128 421 144 58 34.30 2808 0.54 7733
Shape of cluster data: 30, 10)
Cluster centers:
[- 3.04626787e-01-2.89307971e-01-2.90845727e-01-2.88480032e-01
-2.85445404e-01-2.85283077e-01-6.22770669e-02 1.12938023e-03
-2.71308432e-01-3.03408599e-01]
[4.44318512e+00 3.97251590e+00 4.16079449e+00 4.20994153e+00
4.61768098e+00 4.65296699e+00 2.45321197e+00 4.02147595e-01
4.22779099e+00 2.44672575e+00]
[1.52987871e+00 2.10479182e+00 1.97836141e+00 1.92037518e+00
1.54974999e+00 1.50344182e+00 1.13526879e+00 1.13595799e+00
8.39397483e-01 1.38149832e+00]
[4.17353928e-01-6.60092295e-01-5.55528420e-01-5.50211065e-01
-2.95600461e-01-2.42490616e-01-3.10454580e+00-2.70342746e+00
1.14743326e+00 2.67890118e+00]]
Cluster results:
[1 2 2 0 0 0 3 0 0 0]
The results of cluster analysis (divided into 4 categories):
Category 0: 'Shaanxi', 'Liaoning', 'Jilin', 'Heilongjiang', 'Hubei', 'Jiangsu', 'Guangdong', 'Sichuan', 'Shandong', 'Gansu', 'Hunan', 'Zhejiang', 'Xinjiang', 'Fujian', Shanxi, Hebei, Anhui, Yunnan, Jiangxi, Hainan, Inner Mongolia 'Henan', 'Guangxi', 'Ningxia', 'Guizhou', 'Qinghai']
Category 1: ['Beijing']
Category 2: ['Shanghai', 'Tianjin']
Category 3: ['Xizang']
Process finished with exit code 0
4 case 24.1 case-data
(1) data introduction:
There are eight main variables of the average annual consumption expenditure per person of urban households in 31 provinces in 1999. These eight variables are: food, clothing, household equipment supplies and services, health care, transportation and communications, entertainment, education and cultural services, housing and miscellaneous goods and services. Based on the existing data, 31 provinces are clustered.
(2) Experimental purpose:
Through clustering, we can understand the domestic situation of the consumption level of each province in 1999.
The average annual consumption expenditure per person of urban households in 31 provinces of China in 1999:
4.2 Code # * = 1. Create a project and import sklearn-related packages = * * import numpy as npfrom sklearn.cluster import KMeans # * = 2. Load the data, create an example of the K-means algorithm, and train it to get the tag = * * def loadData (filePath): fr = open (filePath, 'rushing') # rreads: read and write Open a text file lines = fr.readlines () # .readlines () read the entire file at once (similar to .read ()) .readline () read-only .readlines () each time is much slower. RetData = [] # retData: used to store consumption information of the city retCityName = [] # retCityName: used to store the city name for line in lines: items = line.strip () .split (",") retCityName.append (items [0]) retData.append ([float (items [I]) for i in range (1, len (items))]) return retData RetCityName # return value: return city name And the consumption information of the city def main (): data, cityName = loadData ('city.txt') # 1. Use loadData method to read data km = KMeans (n_clusters=4) # 2. Create an instance label = km.fit_predict (data) # 3. Call the Kmeans () fit_predict () method to calculate expenses = np.sum (km.cluster_centers_, axis=1) # print (expenses) CityCluster = [], [], [] []] # divide cities into set clusters for i in range (len (cityName)): CityCluster [label [I]] .append (cityname [I]) # output the city of each cluster for i in range (len (CityCluster)): # export the average cost of each cluster to print ("Expenses:%.2f"% expenses [I]) Print (CityCluster [I]) if _ _ name__ = ='_ main__': main () # * = 3. Output label, view the result = = * gather cities according to consumption level n_clusters, cities with similar consumption level in one category # expense: the sum of the values of the cluster centers, that is, the average consumption level
4.3 results
(1) grouped into two categories: km = KMeans (n_clusters=2)
(2) grouped into three categories: km = KMeans (n_clusters=3)
(3) grouped into four categories: km = KMeans (n_clusters=4)
From the results, we can see that provinces and cities with similar consumption levels gather in one category, such as "Beijing", "Shanghai" and "Guangdong" with the highest consumption.
Gathered in the category with the highest consumption. When they are grouped into four categories, the level of consumption can be clearly seen.
4.4 expand & & improve
When calculating the similarity between two pieces of data, Sklearn's K-Means uses Euclidean distance by default. Although there are many methods such as cosine similarity, Mahalanobis distance and so on, the parameters of the distance calculation method are not set.
(1) if you want to customize the way the distance is calculated, you can change the source code here.
(2) scipy.spatial.distance.cdist method is recommended.
Usage: scipy.spatial.distance.cdist (A, B, metric='cosine'):
Important parameters:
Azur A vector
BRV B vector
Metric: the method of calculating the distance An and B. Changing this parameter can change the method of calculating the distance that is called.
Thank you for reading! On "how to use Python language to achieve K-Means clustering algorithm" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, you can share it out for more people to see it!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.