The principle of KMeans algorithm and how to implement it with Spark 04/16 Update SLTechnology News&Howtos

The principle of KMeans algorithm and how to implement it with Spark

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the principle of KMeans algorithm and the implementation of Spark, the content is very detailed, interested friends can refer to, hope to be helpful to you.

1. Introduction to KMeans-algorithm

K-Means algorithm is an unsupervised clustering algorithm, which is easy to implement and has a good clustering effect, so it is widely used.

K-means algorithm, also known as K-average or K-means, is generally used as the first algorithm to master clustering algorithms.

The K here is constant and needs to be set in advance. Generally speaking, the algorithm aggregates the unlabeled M samples into K clusters iteratively.

The process of gathering samples is often divided by the distance between samples as an index.

Core: K-means clustering algorithm is an iterative clustering analysis algorithm, its step is to randomly select K objects as the initial clustering center, and then calculate the distance between each object and each seed clustering center, and assign each object to the nearest clustering center. Cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster will be recalculated according to the existing objects in the cluster. This process will be repeated until a termination condition is met. The termination condition can be that no (or minimum number of) objects are reassigned to different clusters, no (or minimum number of) clustering centers change again, and the sum of square errors is locally minimum.

2.KMeans algorithm flow 2.1 reads the file, prepares the data, preprocesses the data 2.2 randomly finds K points, traverses the dataset as the initial center point, calculates the distance from each point to 3 centers, and belongs to which center point is closest to that center point 2.4 calculates the new center point 2.5 based on the new classification and starts the next cycle with the new center point (continue the loop step 2.3)

Conditions for exiting the loop:

1. Specify the number of cycles

two。 All the center points almost no longer move (that is, the sum of the distances of the center points is less than our given Changshu, such as 0.00001)

3. Advantages and disadvantages of KMeans algorithm

The choice of K value: the influence of k value on the final result is very important, but it must be given in advance. Given an appropriate k value, a priori knowledge is required, and it is difficult to estimate out of thin air, or it may lead to poor results.

The existence of outliers: the K-means algorithm uses the mean of all points as the new particle (center point) in the iterative process. If there are outliers in the cluster, the mean deviation will be more serious. For example, if there are five data such as 2,4,6,8,100 in a cluster, then the new particle is 24, which is obviously far away from most of the points. In the current situation, it may be better to use the median 6 than the mean. The clustering method using the median is called K-Mediods clustering (K median clustering).

Initial value sensitivity: the K-means algorithm is sensitive to initial values, and choosing different initial values may lead to different cluster partition rules. In order to avoid the anomaly of the final result caused by this sensitivity, we can initialize multiple sets of initial nodes to construct different classification rules, and then select the optimal construction rules. In view of this point, it is derived: dichotomy K-Means algorithm, Kmee Meanskeeper + algorithm, K-Means | | algorithm, Canopy algorithm, etc.

The advantages of simple implementation, mobility and good scalability make it one of the most commonly used clustering algorithms.

Spark implementation of 4.KMeans algorithm 4.1 data download and explanation

Link: https://pan.baidu.com/s/1FmFxSrPIynO3udernLU0yQ extraction code: hell copy this content after opening Baidu network disk mobile phone App, the operation is more convenient.

Iris data set, which contains three types of 150 tone data, each containing 50 data, and each record contains 4 features: Calyx length, calyx width, petal length, petal width.

After these four features, the flowers will be clustered, assuming that the value of K will be 3, to see the difference with the actual results.

4.2 implementation

Instead of using the mlb library, use the scala native implementation

Package com.hoult.workimport org.apache.commons.lang3.math.NumberUtilsimport org.apache.spark.SparkContextimport org.apache.spark.rdd.RDDimport org.apache.spark.sql.SparkSessionimport scala.collection.mutable.ListBufferimport scala.math. {pow Sqrt} import scala.util.Randomobject KmeansDemo {def main (args: Array [String]): Unit = {val spark = SparkSession .builder () .master ("local [*]") .appName (this.getClass.getCanonicalName) .getOrCre ate () val sc = spark.sparkContext val dataset = spark.read.textFile ("data/lris.csv") .rdd.map (_ .split (" ") .filter (NumberUtils.isNumber _). Map (_ .toDouble) .filter (! _ .isEmpty). Map (_ .toSeq) val res: RDD [(Seq [Double], Int)] = train (dataset, 3) res.sample (false, 0.1,1234L) .map (tp = > (tp._1.mkString (", ") Tp._2) .foreach (println)} / / defines that the parameters passed in by a method are dataset, K, maximum number of iterations, cost function change threshold / / where the maximum number of iterations and cost function change threshold are set default values You can change def train (data: RDD [Seq [double]], k: Int, maxIter: Int = 40, tol: Double = 1e-4) = {val sc: SparkContext = data.sparkContext var I = 0 / / iterations var cost = 0D / / initial cost function var convergence = false / / to judge the convergence That is, the change of the cost function is less than the threshold tol / / step1: K initial clustering centers var initk: Array [(Seq [Double], Int)] = data.takeSample (false, k, Random.nextLong ()) .zip (Range (0, k)) var res: RDD [(Seq [Double], Int)] = null while (I)

< maxIter && !convergence) { val bcCenters = sc.broadcast(initk) val centers: Array[(Seq[Double], Int)] = bcCenters.valueval clustered: RDD[(Int, (Double, Seq[Double], Int))] = data.mapPartitions(points =>

{val listBuffer = new ListBuffer [(Int, (Double, Seq [Double], Int)] () / / calculate the distance from each sample point to each cluster center points.foreach {point = > / / calculate the id and the sum of squares of the minimum distance, sample points, 1 val cost: (Int, (Double, Seq [Double]) Int) = centers.map (ct = > {ct._2-> (getDistance (ct._1.toArray, point.toArray), point, 1)}). MinBy (_. _ 2.room1) / / assign the sample to the nearest cluster center listBuffer.append (cost)} listBuffer.toIterator}) / / val mpartition: Array [(Int, (Double) Seq [double])] = clustered .reduceByKey ((a) B) = > {val cost = a.room1 + b.room1 / / cost function val count = a.room3 + b.room3 / / the cumulative number of samples for each class val newCenters = a._2.zip (b.room2) .map (tp = > tp._1 + tp._2) / / New cluster center set (cost, newCenters) Count)}) .map {case (clusterId, (costs, point, count)) = > clusterId-> (costs, point.map (_ / count)) / / New clustering center} .clustering () val newCost = mpartition.map (_. _ 2.room1). Sum / / cost function convergence = math.abs (newCost-cost) (tp._2._2 Tp._1)) / / the clustering result returns the sample point and the id res = clustered.map (tp= > (tp._2._2,tp._1)) I + = 1} / / returns the clustering result res} def getDistance (x:Array [Double], y:Array [Double]): Double= {sqrt (x.zip (y) .map (z = > pow (z.

Complete code: https://github.com/hulichao/bigdata-spark/blob/master/src/main/scala/com/hoult/work/KmeansDemo.scala

Result Screenshot:

About the principle of KMeans algorithm and Spark implementation is how to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.