In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article shows you the principle of KNN algorithm and Spark implementation, the content is concise and easy to understand, absolutely can make your eyes bright, through the detailed introduction of this article, I hope you can get something.
1. Brief introduction of KNN-K nearest neighbor algorithm
First of all, KNN is a classification algorithm, supervised machine learning, label the category of the training set, when the test object and the training object exactly match, it can be classified, but the test object and the training object of multiple classes, how to match, the previous can distinguish whether the test object term a training object, but if it is multiple training object classes, then how to solve this problem So KNN,KNN is classified by measuring the distance between different eigenvalues. Its idea is that if most of the k most similar samples in the feature space (that is, the nearest neighbor in the feature space) belong to a certain category, then the sample also belongs to this category, where K is usually an integer less than 20. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. In the decision-making of classification, this method only determines the category of the sample to be divided according to the category of the nearest sample or samples.
The core idea of KNN algorithm is that if most of the K most adjacent samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of samples in this category. In determining the classification decision, this method only determines the category of the sample to be divided according to the category of the nearest sample or samples. KNN method is only related to a very small number of adjacent samples in category decision-making. Because the KNN method mainly depends on the limited adjacent samples, rather than the method of discriminating the class domain, the KNN method is more suitable than other methods for the sample set with more cross or overlap of the class domain.
2.KNN algorithm flow 2.1 prepares the data and preprocesses the data. 2.2 calculate the distance from the test sample point (that is, the point to be sorted) to each other sample point. 2.3 sort each distance, and then select the K points with the smallest distance. 2.4 compare the categories to which the K points belong, and according to the principle that the minority is subordinate to the majority, classify the test sample points into the one that accounts for the highest proportion of the K points. 3. Advantages and disadvantages of KNN algorithm
Advantages: easy to understand, easy to implement, no need to estimate parameters, no training
Disadvantages: if a class in the dataset has a large amount of data, it is bound to cause the test set to run to this class more, because the probability of being closer to these points is also higher.
Spark implementation of 4.KNN algorithm 4.1 data download and explanation
Link: https://pan.baidu.com/s/1FmFxSrPIynO3udernLU0yQ extraction code: hell copy this content after opening Baidu network disk mobile phone App, the operation is more convenient.
Iris data set, which contains three types of 150 tone data, each containing 50 data, and each record contains 4 features: Calyx length, calyx width, petal length, petal width.
Through these four characteristics, we can predict which species of irises (iris-setosa, iris-versicolour, iris-virginica) belong to.
Implement package com.hoult.workimport org.apache.spark.rdd.RDDimport org.apache.spark. {SparkConf, SparkContext} object KNNDemo {def main (args: Array [String]): Unit = {/ / 1. Initialize val conf=new SparkConf () .setAppName ("SimpleKnn") .setMaster ("local [*]") val sc=new SparkContext (conf) val Kraft 15 / / 2. Read data and encapsulate data val data: RDD [LabelPoint] = sc.textFile ("data/lris.csv") .map (line = > {val arr = line.split (",") if (arr.length = = 6) {LabelPoint (arr.last, arr.init.map (_ .toDouble))} else {println (arr.toBuffer) LabelPoint ("") Arr.map (_ .toDouble)}}) / / 3. Filter out sample data and test data val sampleData=data.filter (_ .label! = ") val testData=data.filter (_ .label = =") .map (_ .point) .collect () / / 4. Find the distance between each test data and the sample data testData.foreach (elem= > {val distance=sampleData.map (x = > (getDistance (elem,x.point)) X.label)) / / get the nearest k samples val minDistance=distance.sortBy (_. _ 1) .take (K) / / take out the label of these k samples and get the label val labels=minDistance.map (_. _ 2) .groupBy (x = > x) .mapValues (_ .length) .toList .sortBy (_. _ 2) of the test data with the most label. ) .reverse .take (1) .map (_. _ 1) printf (s "${elem.toBuffer.mkString (") ")}, ${labels.toBuffer.mkString (", ")}) println ()}) sc.stop ()} case class LabelPoint (label:String,point:Array [Double]) import scala.math._ def getDistance (x: array [double]) Y:Array [Double]): Double= {sqrt (x.zip (y) .map (z = > pow (z.fu1murz.fu2)) .sum)}} the above is the principle of KNN algorithm and how Spark is implemented. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.