In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Today, I will talk to you about how to simulate the k-proximity algorithm in python. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.
The first program in this chapter is to find the norm of a vector.
The import math# combinations function is used for combination, indicating that m # is taken from n as a comparison, and the permutations function is used for sorting. Indicates that m # from itertools import combinationsdef L (x, y, p): if len (x) = = len (y) and len (x) > 1: # are taken out of n in turn. Sum = 0 for i in range (len (x)): # I means subscript sum + = math.pow (abs (x [I]-y [I]), p) # pow exponentiation Abs calculates the absolute value return math.pow (sum, 1Zip p) else: return 0x1 = [1,1] x2 = [5,1] x3 = [4,4] for i in range (1,5): r = {"1-{}" .format (c): l (x1, c, pairi) for c in [x2, x3]} print ("current number: {}" Current r: {} ".format (I, r) print (" current minimum: ") print (min (zip (r.values (), r.keys ()
Output result:
Current number: 1, current r: {'1-[5,1]': 4.0,'1-[4,4]': 6.0} current minimum: (4.0,'1-[5,1]') current number: 2, current r: {'1-[5,1]': 4.0,'1-[4,4]': 4.2426687119285} current minimum: (4.0,'1-[5,1]') current minimum: (4.0,'1-[5,1]') Current r: {'1-[5,1]': 3.99999999999996,'1-[4,4]': 3.779763146846193} current minimum: (3.779763146846193,'1-[4,4]') current times: 4, current r: {'1-[5,1]': 4.0,'1-[4,4]': 3.5676213450081633} current minimum: (3.5676213450081633,'1-[4,4]')
The second program simulates the k-proximity algorithm
Idea: traverse all the points, find out the nearest k points, and decide the classification of the test points by majority vote. This method is very time-consuming when the training set is large.
Import numpy as npimport pandas as dpimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_split # is used to divide the data from collections import Counter # counter class KNN: def _ _ init__ (self, X_train, y_train, n_neighbors=3, pendant 2): # sets k and p to represent the nearest three points, respectively And the distance is determined by two norms: self.n = n_neighbors self.p = p self.X_train = X_train self.y_train = y_train def predict (self, X): knn_list = [] # first find the "distance" of the three points. Modify for i in range (self.n): dist = np.linalg.norm (X-self.X_train [I], ord=self.p) knn_list.append ((dist, self.y_ distance [I])) # then recursively traverse the remaining points to determine whether the distance is smaller than the calculated distance. What is left in the final list is the three nearest points for i in range (self.n, len (X_train)): max_index = knn_list.index (max (knn_list, key=lambda x X_train x [0])) dist = np.linalg.norm (X-self.X_ points [I]) Ord=self.p) if knn_ list [max _ index] [0] > dist: knn_ list [max _ index] = (dist, self.y_ lists [I]) # Let's take the last line out of the knn_list That is, the category knn = [k [- 1] for k in knn_list] # of the nearest three points is taken out with a counter, and the result of this step should be {"1": quantity, "0": quantity} count_pairs = Counter (knn) # count_pairs.items () stores, [category, quantity] # sort by the second dimension of the list, from small to large. # sorting here takes into account that there are more than two types of data. # [- 1] [0] fetch the first dimension of the last line That is, the most likely type max_possible = sorted (count_pairs.items (), key=lambda XRX [1]) [- 1] [0] return max_possible def score (self, X_test, y_test): right_count = 0 for X, y in zip (X_test) Y_test): label = self.predict (X) if label = y: right_count + = 1 return right_count/len (X_test) iris = load_iris () df = dp.DataFrame (iris.data, columns=iris.feature_names) df ['label'] = iris.targetdf.columns = ["sepal length", "sepal width", "petal length", "petal width" "label"] data = np.array (df.iloc [: 100,0,1,-1]) X, y = data [:,:-1], data [:,-1] # divide the data into a training set and a test set Test set ratio 0.25X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.25) clf = KNN (X_train, y_train) print ("score:") print (clf.score (X_test, y_test)) test_point = [5.0,4.0] print ('test point score: {}' .format (clf.predict (test_point)) # outputs the possible categories at this point 1 or 0plt.scatter (df [: 50] ['sepal length'], df [: 50] [' sepal width'], label= "0") plt.scatter (df [50 df 100] ['sepal length'], df [50 df 100] [' sepal width'], label= "1") plt.xlabel ('sepal length') plt.ylabel (' sepal width') plt.legend () plt.show ()
Result: the score is 1.0 most of the time and 0.96 occasionally.
Implementation of k-proximity algorithm with kd Tree
When the training set is very large, we construct the kd tree to speed up the retrieval.
The idea of constructing the kd tree: select the coordinate axis (that is, select different features in turn) for segmentation, and the segmentation point usually selects the median of the coordinate axis, so that the constructed kd tree is balanced. Note: the efficiency of a balanced kd tree search is not necessarily optimal. Note that the sentence "choose different features in turn" says that each of the different features has been used once, but at this time the leaf node still has multiple data, so you need to return the first feature to divide again. So the usual feature selection formula (J mod k) + 1, where J is the current node depth (root node depth is 0), k is the number of sample features.
In this example, it is not necessarily reasonable for the program to choose the axis (that is, the feature) from 0.
A more reasonable way to select features is to select the features with the largest difference among all the current features. Because it makes it easier for us to search for the nearest neighbor more quickly. It's like in a gradient descent, we go down from the center of the ellipse to the edge. If it takes us more time to fall from the long axis of the ellipse than from the minor axis! For example, when we go down the mountain, a large variance is like a steeper one, and a smaller variance is like a slower one. We naturally choose a steeper one, so that we can go down the hill faster.
From math import sqrtfrom collections import namedtuplefrom time import clockfrom random import random# defines a namedtuple, which stores the nearest coordinate points, the nearest distance and the number of nodes visited result = namedtuple ("Result_tuple", "nearest_point nearest_dist nodes_visited") # kd-tree the main data structure contained in each node is as follows: class KdNode (object): def _ init__ (self, dom_elt, split, left) Right): self.dom_elt = dom_elt # k-dimensional vector node (a sample point in k-dimensional space) self.split = split # integer (serial number for segmenting dimensions) self.left = left # this node divides the left subspace of the hyperplane kd-tree self.right = right # the node divides the kd- formed by the right subspace of the hyperplane Treeclass KdTree (object): def _ _ init__ (self Data): K = len (data [0]) # data dimension def CreateNode (split, data_set): # divide the dataset by split dimension exset create KdNode if not data_set: # the dataset is empty return None # the value of the key parameter is a function This function takes only one parameter and returns a value for comparison. The itemgetter function provided by the # operator module is used to get data for which dimensions of the object. Parameter is the sequence number of the data to be obtained in the object # data_set.sort (key=itemgetter (split)) # sort by the one-dimensional data to be split data_set.sort (key=lambda x: X [split]) split_pos = len (data_set) / / 2 # / / is the integer division median = data_ set [split _ pos] in Python ] # median split point split_next = (split + 1)% k # cycle coordinates # Recursive create kd tree return KdNode (median Split, CreateNode (split_next, data_set [: split_pos]), # create the left subtree CreateNode (split_next, data_ set [split _ pos + 1:]) # create the right subtree self.root = CreateNode (0, data) # build the kd tree from the 0th dimension component Returns the preorder traversal of the root node # KDTree def preorder (root): print (root.dom_elt) if root.left: # Node is not empty preorder (root.left) if root.right: preorder (root.right) def find_nearest (tree, point): K = len (point) # data dimension def travel (kd_node, target Max_dist): if kd_node is None: return result ([0] * k, float ("inf") 0) # in python, use float ("inf") and float ("- inf") to represent the dimension pivot = kd_node.dom_elt # divided by positive and negative infinite nodes_visited = 1 s = kd_node.split # to segment "axis" if target [s]
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.