In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)05/31 Report--
In this article, the editor introduces in detail "case analysis of string similarity in Python". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "case analysis of string similarity in Python" can help you solve your doubts.
Python string similarity
Using difflib module-to compare the similarity of two strings or text
First import the difflib module
Import difflib
Example:
Str = 'Shanghai Central Building' S1 = 'Shanghai Central Building' S2 = 'Shanghai Central Building' S2 = 'Shanghai Central Building' print (difflib.SequenceMatcher (None, Str, S1). Quick_ratio () print (difflib.SequenceMatcher (None, Str, S2). Quick_ratio () print (difflib.SequenceMatcher (None, Str, S3). Quick_ratio () 0.50.80.8333333333333334Python similarity assessment
"distance" is often used when assessing similarity:
1. When calculating the similarity of pictures, I used the cosine distance myself.
Are you kidding me? you're not studying geometry, how do you get to the cosine of the angle? Ladies and gentlemen, take it easy. The cosine of the angle in geometry can be used to measure the difference between the directions of two vectors, and this concept is used to measure the difference between sample vectors in machine learning.
(1) the cosine formula of the angle between vector A (x1) and vector B (x2) in two-dimensional space:
(2) two n-dimensional sample points a (x11cr x12, … , x1n) and b (x21thx22, … The angle cosine of x2n)
Similarly, for two n-dimensional sample points a (x11cr x12, … , x1n) and b (x21thx22, … X2n), a concept similar to the cosine of the angle can be used to measure the degree of similarity between them
That is:
The range of cosine of included angle is [- 1]. The larger the cosine angle is, the smaller the angle between the two vectors is, and the smaller the cosine angle is, the larger the angle between the two vectors is. When the directions of the two vectors coincide, the cosine of the angle takes the maximum value of 1, and the cosine of the angle of the two vectors takes the minimum value of-1 when the directions of the two vectors are completely opposite.
Import numpy as np# cosine similarity (method 1): def cosin_distance2 (vector1, vector2): user_item_matric = np.vstack ((vector1 Vector2)) sim = user_item_matric.dot (user_item_matric.T) norms = np.array ([np.sqrt (np.diagonal (sim))]) user_similarity = (sim / norms / norms.T) [0] [1] return user_similarity data = np.load ("data/all_features.npy") # sim = cosin_distance (data [22], data [828]) sim = cosin_distance2 (data [22] Data print (sim) # CoSine similarity (method 2) from sklearn.metrics.pairwise import cosine_similaritya = np.array ([1,2,8,4,6]) A1 = np.argsort (a) user_tag_matric = np.vstack ((a, A1)) user_similarity = cosine_similarity (user_tag_matric) print (user_similarity [0] [1]) # CoSine similarity (method 3) from sklearn.metrics.pairwise import pairwise_distancesa = np.array ([1,2,8) 4, 6]) A1 = np.argsort (a) user_tag_matric = np.vstack ((a, A1)) user_similarity = pairwise_distances (user_tag_matric, metric='cosine') print (1-user_similarity [0] [1])
It is important to note that the Cosine distance calculated with pairwise_distances is a 1-(cosine similarity) result
two。 Euclidean distance
Euclidean distance is the most easily understood method of distance calculation, which is derived from the distance formula between two points in Euclidean space.
# 1) given two data points, calculate the euclidean distance between themdef get_distance (data1, data2): points = zip (data1, data2) diffs_squared_distance = [pow (a-b, 2) for (a, b) in points] return math.sqrt (sum (diffs_squared_distance)) 3. Manhattan distance
The method of calculating this distance can be guessed from the name. Imagine you have to drive from one intersection to another in Manhattan. Is the driving distance a straight line between two points? Obviously not, unless you can cross the building. The actual driving distance is this "Manhattan distance". This is also the source of the name Manhattan distance, also known as City Block distance (CityBlock distance).
Def Manhattan (vec1, vec2): npvec1, npvec2 = np.array (vec1), np.array (vec2) return np.abs (npvec1-npvec2). Sum () # Manhattan_Distance,4. Chebyshev distance
Have you ever played chess? The king can move to any of the eight adjacent squares with one step. So how many steps does it take for the king to walk from the lattice (x1jiny1) to the lattice (x2jiny2)? Try to walk by yourself. You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |). There is a similar distance measurement called Chebyshev distance.
Def Chebyshev (vec1, vec2): npvec1, npvec2 = np.array (vec1), np.array (vec2) return max (np.abs (npvec1-npvec2)) # Chebyshev_Distance5. Minkowski distance
Min's distance is not a kind of distance, but a definition of a group of distances.
#! / usr/bin/env python from math import*from decimal import Decimal def nth_root (value,n_root): root_value=1/float (n_root) return round (Decimal (value) * * Decimal (root_value), 3) def minkowski_distance (XLECHERE): return nth_root (pow (abs (aMub), p_value) for aPerry b in zip (XMague y), p_value) print (minkowski_distance) -1], 3)) 6. Standardized Euclidean distance
Standardized Euclidean distance is an improved scheme aiming at the shortcomings of simple Euclidean distance. The idea of standard Euclidean distance: since the distribution of each dimensional component of the data is different, all right! Let me first "standardize" each component to mean and variance.
Def Standardized_Euclidean (vec1,vec2,v): from scipy import spatial npvec = np.array ([np.array (vec1), np.array (vec2)]) return spatial.distance.pdist (npvec, 'seuclidean', V=None) # Standardized Euclidean distance# http://blog.csdn.net/jinzhichaoshuiping/article/details/510194737. Def Mahalanobis (vec1, vec2): npvec1, npvec2 = np.array (vec1), np.array (vec2) npvec = np.array ([npvec1, npvec2]) sub = npvec.T [0]-npvec.T [1] inv_sub = np.linalg.inv (np.cov (npvec1, npvec2)) return math.sqrt (np.dot (inv_sub, sub) .dot (sub.T)) # MahalanobisDistance8. Edit distance def Edit_distance_str (str1, str2): import Levenshtein edit_distance_distance = Levenshtein.distance (str1, str2) similarity = 1-(edit_distance_distance/max (len (str1), len (str2) return {'Distance': edit_distance_distance,' Similarity': similarity} # Levenshtein distance
Where the input data is two arrays of the same dimension
After reading this, the article "case Analysis of string similarity in Python" has been introduced. If you want to master the knowledge points of this article, you still need to practice and use it yourself. If you want to know more about related articles, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 230
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.