How to implement KNN nearest neighbor algorithm by Python 04/27 Update SLTechnology News&Howtos

How to implement KNN nearest neighbor algorithm by Python

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how Python implements KNN nearest neighbor algorithm". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how Python implements the KNN nearest neighbor algorithm".

I. Overview of KNN

Simply put, the K-nearest neighbor algorithm uses the method of measuring the distance between different eigenvalues for classification.

Advantages: high precision, insensitive to abnormal values, no data input assumption

Disadvantages: high computational complexity and high space complexity

Applicable data range: numerical type and nominal type 2

How it works: there is a sample data set, also known as the training sample set, and there is a label for each data in the sample set, that is, we know the corresponding relationship between each data in the sample set and its classification (training set). After inputting the new data without label, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label (test set) of the most similar data (nearest neighbor) in the sample set. Generally speaking, we only select the first k most similar data in the sample data set, which is the origin of k in the k-nearest neighbor algorithm. (usually k is not less than 20)

Use Python to import data

Let's write a piece of code first.

From numpy import * # Import numpy module import operator # Import operator module def createDataSet (): # create dataset function # build an array to store eigenvalues group = array ([[1.0,1.1], [1.0,1.0], [0,0], [0] ) # build an array to store the target values labels = ['A','B','B'] return group, labels

Let's introduce the numpy bag a little bit here.

3. Numpy.array ()

The main object of NumPy is a multi-dimensional array of the same element. This is an element table in which all elements are of one type and indexed by a positive integer tuple (usually elements are numbers).

In NumPy, the dimensions is called axes, and the number of axes is called rank, but it is not the same as the rank in linear algebra. In using python to calculate the rank in linear algebra, we use the linalg.matrix_rank method in numpy package to calculate the rank of matrix.

The definition of rank in linear algebra: suppose that there is an r-order subform D in matrix A that is not equal to 0, and all ringing 1-order subforms (if they exist) are equal to 0, then D is called the highest-order non-zero subform of matrix A, and the number r is called the rank of matrix A, which is denoted as R (A).

4. Implement KNN classification algorithm

According to the KNN algorithm, we come in turn.

Prepare the four required data first.

InX: input vector inX for classification

DataSet: input training sample set dataSet

Labels: label vector labels (the number of elements is the same as the number of rows of matrix dataSet)

K: select the number of nearest neighbors

5. Calculate the distance between the point and the current point in the known category data set. 6. Complete code # the number of rows of the return matrix dataSetSize = dataSet.shape [0] # the number of columns remains the same, and the number of rows becomes the dataSetSize column diffMat = tile (inX, (dataSetSize, 1))-dataSetsqDiffMat = diffMat * * 2sqDistances = sqDiffMat.sum (axis=1) distances = sqDistances**0.5

First line

# return the number of rows of the matrix dataSetSize = dataSet.shape [0] # take the data in the first step as an example, answer:4 # 4 rows

Second line

InX = [1., 0.] # the number of columns remains the same The number of rows becomes the dataSetSize column diffMat = tile (inX, (dataSetSize, 1))-dataSet# tile (inX, (dataSetSize, 1)) inX = [[1. 0], [1. 0.], [1. 0.], [1., 0.] # inX-dataSet two matrices subtract (rows are equal, add and subtract) dataSet = [1. 1.1], [1., 1.], [0. , 0. ], [0. , 0. 1]] diffMat = [0. ,-1.1], [0. ,-1.], [1., 0.], [1.-0.1]]

The third line

# find the square difference sqDiffMat = diffMat * 2

Fourth line

# calculate the sum of each row of elements in the matrix # this will form a multi-row and 1-column matrix sqDistances = sqDiffMat.sum (axis=1)

The fifth line

# the root sign distances = sqDistances**0.5

Sort according to the order of increasing distance

# sort the array sortedDistIndicies = distances.argsort ()

Select the k points with the smallest distance from the current point

ClassCount = {} # create a new dictionary # determine the main category for i in range (k) where the first k minimum distance elements are located: the value of # voteIlabel is the position of sortedDistIndicies [I] in labels voteIlabel = labels [sorted distances [I]] classCount [voteIlabel] = classCount.get (voteIlabel, 0) + 1

Determine the occurrence probability of the category where the first k points are located

# sort sortedClassCount = sorted (classCount.iteritems (), key=operator.itemgetter (1), reverse=True)

# 1 "returns the category with the highest frequency of occurrence of the first k points as the prediction classification of the current point

Return sortedClassCount [0] [0]

Just try the C++ version. Careful! help!

# include # include int sum_vector (std::vector& v) {int sum = 0; for (int I = 0; I

< v.size(); ++i) { sum = v[i] + sum; } return sum;}int knn(int k) { using std::cout; using std::endl; using std::vector; vector x; vector x_sample = {2, 3, 4}; for (int i = 0; i < 4; ++i) { x.push_back(x_sample); } vector y = {1, 1, 1, 1}; int dataSetSize = x.size(); vector x_test = {4, 3, 4}; vector x_test_matrix; for (int i = 0; i < dataSetSize; ++i) { x_test_matrix.push_back(x_test); } vector v_total; for (int i = 0; i < dataSetSize; ++i) { for (int j = 0; j < x_test_matrix[i].size(); ++j) { x_test_matrix[i][j] = x_test_matrix[i][j] - x[i][j]; x_test_matrix[i][j] = x_test_matrix[i][j] * 2; } int sum_vec = sum_vector(x_test_matrix[i]); v_total.push_back(sqrt(sum_vec)); } sort(v_total.begin(), v_total.end()); std::map mp; for (int i = 0; i < k; ++i) { int label = y[v_total[i]]; mp[label] += 1; } int max_end_result = 0; for (std::map::iterator it = mp.begin(); it != mp.end(); it++) { if (it->

First > max_end_result) {max_end_result = it- > first;}} return max_end_result;} int main () {int k = 12; int value = knn (k); std::cout

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.