Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The principle of KNN algorithm and how to implement it with Spark

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article shows you the principle of KNN algorithm and Spark implementation, the content is concise and easy to understand, absolutely can make your eyes bright, through the detailed introduction of this article, I hope you can get something.

1. Brief introduction of KNN-K nearest neighbor algorithm

First of all, KNN is a classification algorithm, supervised machine learning, label the category of the training set, when the test object and the training object exactly match, it can be classified, but the test object and the training object of multiple classes, how to match, the previous can distinguish whether the test object term a training object, but if it is multiple training object classes, then how to solve this problem So KNN,KNN is classified by measuring the distance between different eigenvalues. Its idea is that if most of the k most similar samples in the feature space (that is, the nearest neighbor in the feature space) belong to a certain category, then the sample also belongs to this category, where K is usually an integer less than 20. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. In the decision-making of classification, this method only determines the category of the sample to be divided according to the category of the nearest sample or samples.

The core idea of KNN algorithm is that if most of the K most adjacent samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of samples in this category. In determining the classification decision, this method only determines the category of the sample to be divided according to the category of the nearest sample or samples. KNN method is only related to a very small number of adjacent samples in category decision-making. Because the KNN method mainly depends on the limited adjacent samples, rather than the method of discriminating the class domain, the KNN method is more suitable than other methods for the sample set with more cross or overlap of the class domain.

2.KNN algorithm flow 2.1 prepares the data and preprocesses the data. 2.2 calculate the distance from the test sample point (that is, the point to be sorted) to each other sample point. 2.3 sort each distance, and then select the K points with the smallest distance. 2.4 compare the categories to which the K points belong, and according to the principle that the minority is subordinate to the majority, classify the test sample points into the one that accounts for the highest proportion of the K points. 3. Advantages and disadvantages of KNN algorithm

Advantages: easy to understand, easy to implement, no need to estimate parameters, no training

Disadvantages: if a class in the dataset has a large amount of data, it is bound to cause the test set to run to this class more, because the probability of being closer to these points is also higher.

Spark implementation of 4.KNN algorithm 4.1 data download and explanation

Link: https://pan.baidu.com/s/1FmFxSrPIynO3udernLU0yQ extraction code: hell copy this content after opening Baidu network disk mobile phone App, the operation is more convenient.

Iris data set, which contains three types of 150 tone data, each containing 50 data, and each record contains 4 features: Calyx length, calyx width, petal length, petal width.

Through these four characteristics, we can predict which species of irises (iris-setosa, iris-versicolour, iris-virginica) belong to.

Implement package com.hoult.workimport org.apache.spark.rdd.RDDimport org.apache.spark. {SparkConf, SparkContext} object KNNDemo {def main (args: Array [String]): Unit = {/ / 1. Initialize val conf=new SparkConf () .setAppName ("SimpleKnn") .setMaster ("local [*]") val sc=new SparkContext (conf) val Kraft 15 / / 2. Read data and encapsulate data val data: RDD [LabelPoint] = sc.textFile ("data/lris.csv") .map (line = > {val arr = line.split (",") if (arr.length = = 6) {LabelPoint (arr.last, arr.init.map (_ .toDouble))} else {println (arr.toBuffer) LabelPoint ("") Arr.map (_ .toDouble)}}) / / 3. Filter out sample data and test data val sampleData=data.filter (_ .label! = ") val testData=data.filter (_ .label = =") .map (_ .point) .collect () / / 4. Find the distance between each test data and the sample data testData.foreach (elem= > {val distance=sampleData.map (x = > (getDistance (elem,x.point)) X.label)) / / get the nearest k samples val minDistance=distance.sortBy (_. _ 1) .take (K) / / take out the label of these k samples and get the label val labels=minDistance.map (_. _ 2) .groupBy (x = > x) .mapValues (_ .length) .toList .sortBy (_. _ 2) of the test data with the most label. ) .reverse .take (1) .map (_. _ 1) printf (s "${elem.toBuffer.mkString (") ")}, ${labels.toBuffer.mkString (", ")}) println ()}) sc.stop ()} case class LabelPoint (label:String,point:Array [Double]) import scala.math._ def getDistance (x: array [double]) Y:Array [Double]): Double= {sqrt (x.zip (y) .map (z = > pow (z.fu1murz.fu2)) .sum)}} the above is the principle of KNN algorithm and how Spark is implemented. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report