How to realize two-dimensional data clustering by K-means algorithm 07/01 Update SLTechnology News&Howtos

How to realize two-dimensional data clustering by K-means algorithm

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces how the K-means algorithm to achieve two-dimensional data clustering, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.

The so-called clustering analysis is to give a set of elements D, in which each element has n observation attributes. For these attributes, D is divided into K subsets by some algorithm, which requires that the similarity between the elements within each subset is as high as possible. The element similarity of different subsets is as low as possible. Clustering analysis is an unsupervised observational learning method, which can not know the category or even the given number of categories before clustering. At present, clustering is widely used in statistics, biology, database technology, marketing and other fields.

There are many clustering algorithms, such as K-means (K-means clustering), K-center clustering, density clustering, pedigree clustering, maximum expectation clustering and so on. Here we focus on the K-means clustering algorithm. The basic idea of the algorithm is to cluster with K points in the space as the center and classify the objects closest to them. Through the iterative method, the values of each cluster center are updated one by one until the best clustering result is obtained. K-means algorithm has the advantages of simple implementation, fast calculation speed, easy to understand principle and ideal clustering effect, so it is recognized as one of the classical data mining methods.

For example, for common two-dimensional data sets, a K-means clustering method is designed to cluster 80 two-dimensional data points. The Python language implementation and processing of K-means algorithm are as follows:

The 80 2D sample datasets shown in the following figure are stored as testSet text documents. After data preprocessing and simple analysis, we know that there are four categories in the data set, so we can determine that the clustering number K is 4.

First import the necessary modules:

Import kmeans

Import numpy as np

Import matplotlib.pyplot as plt

From math import sqrt

(1) load a dataset from a file

Construct the data matrix and read the data line by line from the text to form a data matrix for subsequent use.

DataSet= []

FileIn=open ('testSet.txt')

For line in fileIn.readlines ():

LineArr=line.strip () .split ('\ t')

DataSet.append ([float (lineArr [0]), float (lineArr [1])])

(2) call kmeans algorithm for data clustering

Use the following command to call the designed kmeans module for data clustering.

DataSet=np.mat (dataSet)

Kendall 4

Centroids,clusterAssment=kmeans.kmeanss (dataSet,k)

The kmeans module mainly contains the following functions.

Distance measurement function. The Euclidean distance is used here, and the calculation process is as follows:

Def eucDistance (vec1,vec2):

Return sqrt (sum (pow (vec2-vec1,2)

Initial clustering center selection. K data points are randomly selected from the dataset and used as the initial clustering center.

Def initCentroids (dataSet,k):

NumSamples,dim=dataSet.shape

Centroids=np.zeros ((kQuery dim))

For i in range (k):

Index=int (np.random.uniform (0numSamples))

Centroids [iMagne:] = dataSet [index,:]

Return centroids

K-Means clustering algorithm. The algorithm creates k centroids, assigns each point to the nearest centroid, and recalculates the centroid. This process is repeated several times until the cluster allocation result of the data point no longer changes the location.

Def kmeanss (dataSet,k):

NumSamples=dataSet.shape [0]

ClusterAssement=np.mat (np.zeros ((numSamples,2)

ClusterChanged=True

# # step1:init centroids

Centroids=initCentroids (dataSet,k)

While clusterChanged:

ClusterChanged=False

For i in range (numSamples):

MinDist = 100000.0

MinIndex=0

# # step2 find the centroid who is closest

For j in range (k):

Distance=eucDistance (centroids [JJM:], dataSet [iBI:])

If distance < minDist:

MinDist=distance

MinIndex=j

# # step3: update its cluster

ClusterAssement [iBM:] = minIndex,minDist**2

If clusterAssement [iPermine 0]! = minIndex:

ClusterChanged=True

# # step4: update centroids

For j in range (k):

PointsInCluster=dataSet [np.nonzero (clusterAssement [:, 0] .A = = j) [0]]

Centroids [JJM:] = np.mean (pointsInCluster,axis=0)

Print ('Congratulations,cluster completes')

Return centroids,clusterAssement

The clustering results show that. The data of different clusters are displayed with different colors and symbols, and the final cluster center is drawn at the same time.

Def showCluster (dataSet,k,centroids,clusterAssement):

NumSamples,dim=dataSet.shape

Mark= ['or','ob','og','ok',' ^ raster'

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.