How to analyze K-means Clustering 03/19 Update SLTechnology News&Howtos

How to analyze K-means Clustering

2026-03-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to analyze K-means Clustering, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

One: algorithm

K-means algorithm is commonly used in machine learning clustering algorithm, and it is also the most basic algorithm. Clustering algorithm belongs to unsupervised learning algorithm. The steps of the algorithm are divided into the following two steps: 1, according to the value of the packet size K, find out K central points, and at this time other points are also divided to this central point according to the distance from the central point. 2, find out the optimal central point of each cluster, redistribute the point, and iterate.

Two: Spark MLlib

Spark MLlib provides an implementation of the K-means algorithm.

Data source

The data comes from the KDD CUP website, which is connected to the Internet. Download

Find data-> kddcup.data.zip and download it.

The format of each row of data is as follows:

0,tcp,http,SF,215,45076, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00, 0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.

Except that the last one is label, the rest are features. Label may not be accurate, these label only mark the exceptions that can be found, but k-means can find unknown exceptions.

two。 Read data

Val rawDataPath = "Your kddcup.data.txt Path" val rawData = sc.textFile (rawDataPath) val labelsAndData = rawData.flatMap {line = > val buffer = line.split (','). ToBuffer if (buffer.length = 42) {buffer.remove (1,3) val label = buffer.remove (buffer.length-1) val vector = Vectors.dense (buffer.map (_ .toDouble) .toArray) Some (label Vector)} else {None}}

The data removes the 2nd, 3rd and 4th columns, and the last column of data.

3. K-Means algorithm

Val kmeans = new KMeans () kmeans.setK (k) / / the default K is 2 kmeans.setRuns (10) / / the number of times to find the center kmeans.setEpsilon (1.0e-6) / / the distance between each change of the center point, the smaller the distance, the farther val model = kmeans.run (data)

Use the generated model and cluster

Val clusterLabelCount = labelsAndData.map {case (label,datum) = > val cluster = model.predict (datum) (cluster,label)}. CountByValue clusterLabelCount.toSeq.sorted.foreach {case ((cluster,label), count) = > println (f "$cluster%1s$labels$count%8s")}

4. How to choose K

The choice of K is a problem. Normally, the higher the value of K is, the better the clustering effect is. Imagine if each point is a separate class.

In addition, we can use the distance of other points from the center point to see the effect of clustering:

Def distance (a: Vector, b: Vector) = {math.sqrt (a.toArray.zip (b.toArray) .map (p = > p. Vector) .sum)} def distToCentroid (datum: Vector, model: KMeansModel) = {val cluster = model.predict (datum) val centroid = model.clusterCenters (cluster) distance (centroid, datum)} def clusteringScore (data: RDD [Vector] K: Int) = {val kmeans = new KMeans () kmeans.setK (k) kmeans.setRuns (10) kmeans.setEpsilon (1.0e-6) val model = kmeans.run (data) data.map (datum = > distToCentroid (datum, model). Mean ()} (30 to 150 by 10) .map (k = > clusteringScore (data,k)) .foreach (println)

With the evaluation, we can examine the effect of the size of K on clustering in turn.

The answer to the question on how to analyze K-means Clustering is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.