In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces how to achieve a KNN nearest neighbor algorithm in Python, the content is very detailed, interested friends can refer to, hope to be helpful to you.
K-NN is a basic classification and regression method. When it is used for classification, the idea of the algorithm is relatively simple: the nearest k training instances are obtained by calculating the distance between different features, and the majority vote is used to predict according to the category of k instances. When doing regression analysis, it is predicted by taking the mean value of k examples. Therefore, we can see the three basic elements of k-NN: K value selection, distance measurement and classification decision rules.
First, algorithm analysis
Input: data sets for training sets and categories are represented as follows:
T = {(x1 ~ Y1), (x ~ 2 ~ 2), … , (xN,yN)}
Where the output: the class y to which instance x belongs is the category of the instance.
The main results are as follows: (1) according to the given distance measure, the k points closest to x are found in the training set T.
(2) for k points, the category y of x is determined according to the classification decision rules (such as majority voting):
II. Basic elements
Distance measure: the distance between two instances in feature space reflects the similarity of two instance points. Euclidean distance is usually used in k-NN model, but other distances can also be selected, such as Manhattan distance, Chebyshev distance and Minkowski distance.
The choice of k value: the choice of k value has a great influence on the result of k-NN. If you choose a smaller k value, it is equivalent to using a training example in a smaller field to predict. At this time, the prediction result is very sensitive to the nearest neighbor instance point. If the instance point happens to be noise, the prediction will be wrong, that is to say, the smaller the k value is, the more complex the overall model is, and the model is prone to over-fitting. The case of k _ nearest _ 1 is called the nearest neighbor algorithm. If you choose a larger k value, it is equivalent to using training examples in larger areas to predict, at this time, it is easy to appear some distant training examples (dissimilar) will also have an effect on the prediction, k is worth increasing means that the overall model becomes simpler. When kicking N, a model will appear that the simple prediction of the input instance belongs to the most class in the training instance. Therefore, in the application, k generally takes a small value, and the cross-verification method is usually used to select the optimal k value.
Classification decision rules: the classification decision rules in k-NN generally choose majority voting, that is, the majority of the k adjacent training instances of the input instance determine the class of the input instance.
Third, implementation of the algorithm
Algorithm steps:
Step.1--- initialization distance is maximum
Step.2--- calculates the distance dist between unknown samples and each training sample
Step.3--- obtains the maximum distance maxdist of the K nearest samples at present.
Step.4--- if dist is less than maxdist, then the training sample is taken as the K-nearest neighbor sample.
Step.5--- repeats steps 2, 3, and 4 until the distance between the unknown sample and all the training samples is calculated.
Step.6--- statistics the number of times each class label appears in the K-nearest neighbor sample
Step.7--- selects the class label with the highest frequency as the class label of the unknown sample.
Fourth, algorithm optimization
When realizing the nearest neighbor of k-NN, the main problem is how to search the training data quickly, which is especially important for the feature space with large dimension and large capacity of training data. The simplest implementation method of k-NN is linear scanning, that is, calculating the distance between each input instance and the training instance. When the training set is very large, it is very time-consuming and loses its own meaning. In order to improve the search efficiency of k nearest neighbors, we can use kd tree. Because we don't discuss kd tree first, we will write later. In addition, when we use k-NN, we may choose nearest neighbors that are too far away, which have low similarity with input instances, so one compensation method is to give them corresponding weights according to the distance, and there are three commonly used methods to obtain weights.
1. Inverse function
Take the inverse function of the distance as the weight, the simplest way is to take the reciprocal of the distance, but there is a problem, when there is exactly the same or very close instance, it will make the weight very large, even infinite, based on this, add a small constant before taking the derivative of the distance.
2. Subtraction function
Subtract the distance from a constant value, if the result of subtraction is greater than 0, the weight is the result of subtraction, otherwise, the result is 0, this method overcomes the problem of excessive weight distribution, but it also has some limitations. because the weight is limited to an example with a certain distance, we may not be able to find items close enough to be regarded as close neighbors, that is, for some examples, it is impossible to make a prediction.
3. Gaussian function
In this method, the weight is taken according to the Gaussian function, and the weight is 1 when the distance is 0, and the weight decreases with the increase of the distance. Unlike the subtraction function, the weight does not fall to 0, which well overcomes the limitations of the first two functions, but it is relatively complex, which makes the execution speed not as fast as the first two functions.
Fifth, the advantages and disadvantages of the algorithm
K-NN can use complex functions for numerical prediction, while maintaining the characteristics of easy to understand, it is an online (online) technology, that is, new data can be directly added to the collection without any calculation at any time.
The main disadvantage of k-NN is that in order to complete the prediction, all the training set data must be indispensable. When faced with millions of sample data sets, there are problems in space and time.
On how to achieve a KNN nearest neighbor algorithm in Python to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.