What are the differences between knn and k-means 04/19 Update SLTechnology News&Howtos

What are the differences between knn and k-means

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail what are the differences between knn and k-means. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

The difference between knn and k-means: 1. The typical distance-based clustering algorithm of [k-means] algorithm uses distance as the evaluation index of similarity, that is, the closer the distance between the two objects, the greater the similarity; 2. Knn algorithm has no obvious pre-training process, when the program starts to run, the data set is loaded into memory to start classification.

The difference between knn and k-means:

1. The process and principle of k-means clustering algorithm

K-means algorithm (k-means clustering algorithm) is a basic partition algorithm that knows the number of clustering categories. It is a typical distance-based clustering algorithm, which uses distance as the evaluation index of similarity, that is, the closer the distance between two objects, the greater the similarity. It is measured by Euclidean distance (the simple understanding is the straight line distance between two points, and the Euclidean distance only standardizes the definition of this distance and extends it to N-dimension). It can handle large data sets and is efficient. The clustering result is k data sets which are divided into k classes. According to the expression of clustering results, it can be divided into hard k-means (H CM) algorithm, fuzzy k-means algorithm (F CM) and probabilistic k-means algorithm (P CM).

1.1. Basic thought

It is based on the given clustering objective function, and the algorithm adopts the iterative updating method, and each iterative process is carried out in the direction of the reduction of the objective function. The final clustering result makes the objective function minimum and achieves a better classification effect.

1.2 principle

The original k-means algorithm first randomly selects k points as the initial clustering center, then calculates the distance from each data object to each clustering center, and classifies the data object into the class where the nearest clustering center is located; the adjusted new class calculates a new clustering center, which indicates that the clustering criterion function f has converged after the end of data object adjustment. In each iteration, it is necessary to check whether the classification of each sample is correct, and if not, it is necessary to adjust. After all the data are adjusted, the cluster center is modified to move on to the next iteration. If all the data objects are correctly classified in an iterative algorithm, there will be no adjustment and no change in the clustering center, which indicates that f has been converged and the algorithm ends.

1.3 algorithm flow chart

1.4 how to choose the initial point of the algorithm?

1) Select K points from the batch as far away as possible

Firstly, a point is randomly selected as the center point of the first initial cluster, and then the point farthest from the point is selected as the center point of the second initial cluster. Then the point with the closest distance from the first two points is selected as the center point of the third initial cluster, and so on, until K initial cluster center points are selected.

2) choose hierarchical clustering or Canopy algorithm for initial clustering, and then use the center point of these clusters as the initial cluster center point of K-Means algorithm.

1.5 how to select k in the algorithm?

As long as the number of hypothetical clusters is equal to or higher than the real number of clusters, the index will rise slowly, and when trying to get less than the real number of clusters, the index will rise sharply. Cluster index is an important reference index.

The diameter of a cluster refers to the maximum distance between any two points in the cluster.

The radius of a class cluster refers to the maximum distance from all points in the class cluster to the center of the class cluster.

1.6 advantages and disadvantages and how to improve them?

It is easy to use because it uses a random element, so it is not guaranteed to find the best class. There is no need for a reasonable number of clusters to be initialized: that is, to initialize K.

2. K-nearest neighbor classification algorithm (K N N)

2.1 problem introduction

K N N's idea: from the above picture, we can see that the data set in the figure is good data, that is, all typed label, one is the blue square, the other is the red triangle, and the green circle is the data we need to classify. If there are two red triangles and one blue square closest to the green point, these three points vote, so the green point to be classified belongs to the red triangle. If Kraft 5, then the nearest green point has two red triangles and three blue squares, and these five points vote. So the green point to be classified belongs to the blue square, that is, if a sample belongs to this category if most of the k most adjacent samples in the feature space belong to this category. We can see that K N N is essentially based on a statistical method of data! In fact, many machine learning algorithms are based on data statistics.

2.2 K N N algorithm

Introduction

K N N, or K-Nearest Neighbor, is a kind of memory-based learning, also called instance-based learning, which belongs to lazy learning. That is, there is no obvious pre-training process, but when the program starts to run, after loading the data set into memory, there is no need for training to start classification. K N N is also a supervised learning algorithm, which calculates the distance between the eigenvalues of the new data and the training data, and then selects K (K > = 1) nearest neighbors for classification (voting method) or regression. If Kroom1, the new data is simply assigned to the class of its nearest neighbor.

Steps

1) calculate the distance between the test data and each training data; the Euclidean distance formula can be used to calculate.

2) sort according to the increasing relationship of distance

3) select the K points with the smallest distance (the k value is determined by yourself)

4) determine the occurrence frequency of the category in which the first K points are located

5) return the category with the highest frequency in the first K points as the predictive classification of the test data.

Characteristics

Nonparametric statistical method: there is no need to introduce the selection of parameter K: K = 1, the sample to be classified is classified into the class of the nearest sample. K = | X | the frequency statistics is carried out only according to the training samples, and the samples to be classified are classified into the most categories. K needs a reasonable choice, too small is easy to be disturbed, and too large increases the computational complexity. The complexity of the algorithm: dimension disaster, when the dimension increases, the number of training samples increases sharply, generally using dimension reduction treatment.

2.3 advantages and disadvantages of algorithms

Advantages: simple and effective

Disadvantages: large amount of calculation. The explanation of the output is not strong. All training samples need to be stored.

3. The difference between K N N and k-means

What are the differences between knn and k-means to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.