In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces how the K-means algorithm to achieve two-dimensional data clustering, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.
The so-called clustering analysis is to give a set of elements D, in which each element has n observation attributes. For these attributes, D is divided into K subsets by some algorithm, which requires that the similarity between the elements within each subset is as high as possible. The element similarity of different subsets is as low as possible. Clustering analysis is an unsupervised observational learning method, which can not know the category or even the given number of categories before clustering. At present, clustering is widely used in statistics, biology, database technology, marketing and other fields.
There are many clustering algorithms, such as K-means (K-means clustering), K-center clustering, density clustering, pedigree clustering, maximum expectation clustering and so on. Here we focus on the K-means clustering algorithm. The basic idea of the algorithm is to cluster with K points in the space as the center and classify the objects closest to them. Through the iterative method, the values of each cluster center are updated one by one until the best clustering result is obtained. K-means algorithm has the advantages of simple implementation, fast calculation speed, easy to understand principle and ideal clustering effect, so it is recognized as one of the classical data mining methods.
For example, for common two-dimensional data sets, a K-means clustering method is designed to cluster 80 two-dimensional data points. The Python language implementation and processing of K-means algorithm are as follows:
The 80 2D sample datasets shown in the following figure are stored as testSet text documents. After data preprocessing and simple analysis, we know that there are four categories in the data set, so we can determine that the clustering number K is 4.
First import the necessary modules:
Import kmeans
Import numpy as np
Import matplotlib.pyplot as plt
From math import sqrt
(1) load a dataset from a file
Construct the data matrix and read the data line by line from the text to form a data matrix for subsequent use.
DataSet= []
FileIn=open ('testSet.txt')
For line in fileIn.readlines ():
LineArr=line.strip () .split ('\ t')
DataSet.append ([float (lineArr [0]), float (lineArr [1])])
(2) call kmeans algorithm for data clustering
Use the following command to call the designed kmeans module for data clustering.
DataSet=np.mat (dataSet)
Kendall 4
Centroids,clusterAssment=kmeans.kmeanss (dataSet,k)
The kmeans module mainly contains the following functions.
Distance measurement function. The Euclidean distance is used here, and the calculation process is as follows:
Def eucDistance (vec1,vec2):
Return sqrt (sum (pow (vec2-vec1,2)
Initial clustering center selection. K data points are randomly selected from the dataset and used as the initial clustering center.
Def initCentroids (dataSet,k):
NumSamples,dim=dataSet.shape
Centroids=np.zeros ((kQuery dim))
For i in range (k):
Index=int (np.random.uniform (0numSamples))
Centroids [iMagne:] = dataSet [index,:]
Return centroids
K-Means clustering algorithm. The algorithm creates k centroids, assigns each point to the nearest centroid, and recalculates the centroid. This process is repeated several times until the cluster allocation result of the data point no longer changes the location.
Def kmeanss (dataSet,k):
NumSamples=dataSet.shape [0]
ClusterAssement=np.mat (np.zeros ((numSamples,2)
ClusterChanged=True
# # step1:init centroids
Centroids=initCentroids (dataSet,k)
While clusterChanged:
ClusterChanged=False
For i in range (numSamples):
MinDist = 100000.0
MinIndex=0
# # step2 find the centroid who is closest
For j in range (k):
Distance=eucDistance (centroids [JJM:], dataSet [iBI:])
If distance < minDist:
MinDist=distance
MinIndex=j
# # step3: update its cluster
ClusterAssement [iBM:] = minIndex,minDist**2
If clusterAssement [iPermine 0]! = minIndex:
ClusterChanged=True
# # step4: update centroids
For j in range (k):
PointsInCluster=dataSet [np.nonzero (clusterAssement [:, 0] .A = = j) [0]]
Centroids [JJM:] = np.mean (pointsInCluster,axis=0)
Print ('Congratulations,cluster completes')
Return centroids,clusterAssement
The clustering results show that. The data of different clusters are displayed with different colors and symbols, and the final cluster center is drawn at the same time.
Def showCluster (dataSet,k,centroids,clusterAssement):
NumSamples,dim=dataSet.shape
Mark= ['or','ob','og','ok',' ^ raster'
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.