In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-09-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
How to analyze K-means Clustering, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
One: algorithm
K-means algorithm is commonly used in machine learning clustering algorithm, and it is also the most basic algorithm. Clustering algorithm belongs to unsupervised learning algorithm. The steps of the algorithm are divided into the following two steps: 1, according to the value of the packet size K, find out K central points, and at this time other points are also divided to this central point according to the distance from the central point. 2, find out the optimal central point of each cluster, redistribute the point, and iterate.
Two: Spark MLlib
Spark MLlib provides an implementation of the K-means algorithm.
Data source
The data comes from the KDD CUP website, which is connected to the Internet. Download
Find data-> kddcup.data.zip and download it.
The format of each row of data is as follows:
0,tcp,http,SF,215,45076, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00, 0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
Except that the last one is label, the rest are features. Label may not be accurate, these label only mark the exceptions that can be found, but k-means can find unknown exceptions.
two。 Read data
Val rawDataPath = "Your kddcup.data.txt Path" val rawData = sc.textFile (rawDataPath) val labelsAndData = rawData.flatMap {line = > val buffer = line.split (','). ToBuffer if (buffer.length = 42) {buffer.remove (1,3) val label = buffer.remove (buffer.length-1) val vector = Vectors.dense (buffer.map (_ .toDouble) .toArray) Some (label Vector)} else {None}}
The data removes the 2nd, 3rd and 4th columns, and the last column of data.
3. K-Means algorithm
Val kmeans = new KMeans () kmeans.setK (k) / / the default K is 2 kmeans.setRuns (10) / / the number of times to find the center kmeans.setEpsilon (1.0e-6) / / the distance between each change of the center point, the smaller the distance, the farther val model = kmeans.run (data)
Use the generated model and cluster
Val clusterLabelCount = labelsAndData.map {case (label,datum) = > val cluster = model.predict (datum) (cluster,label)}. CountByValue clusterLabelCount.toSeq.sorted.foreach {case ((cluster,label), count) = > println (f "$cluster%1s$labels$count%8s")}
4. How to choose K
The choice of K is a problem. Normally, the higher the value of K is, the better the clustering effect is. Imagine if each point is a separate class.
In addition, we can use the distance of other points from the center point to see the effect of clustering:
Def distance (a: Vector, b: Vector) = {math.sqrt (a.toArray.zip (b.toArray) .map (p = > p. Vector) .sum)} def distToCentroid (datum: Vector, model: KMeansModel) = {val cluster = model.predict (datum) val centroid = model.clusterCenters (cluster) distance (centroid, datum)} def clusteringScore (data: RDD [Vector] K: Int) = {val kmeans = new KMeans () kmeans.setK (k) kmeans.setRuns (10) kmeans.setEpsilon (1.0e-6) val model = kmeans.run (data) data.map (datum = > distToCentroid (datum, model). Mean ()} (30 to 150 by 10) .map (k = > clusteringScore (data,k)) .foreach (println)
With the evaluation, we can examine the effect of the size of K on clustering in turn.
The answer to the question on how to analyze K-means Clustering is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
The market share of Chrome browser on the desktop has exceeded 70%, and users are complaining about
The world's first 2nm mobile chip: Samsung Exynos 2600 is ready for mass production.According to a r
A US federal judge has ruled that Google can keep its Chrome browser, but it will be prohibited from
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
About us Contact us Product review car news thenatureplanet
More Form oMedia: AutoTimes. Bestcoffee. SL News. Jarebook. Coffee Hunters. Sundaily. Modezone. NNB. Coffee. Game News. FrontStreet. GGAMEN
© 2024 shulou.com SLNews company. All rights reserved.