Case Analysis of string similarity in Python 04/16 Update SLTechnology News&Howtos

Case Analysis of string similarity in Python

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

In this article, the editor introduces in detail "case analysis of string similarity in Python". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "case analysis of string similarity in Python" can help you solve your doubts.

Python string similarity

Using difflib module-to compare the similarity of two strings or text

First import the difflib module

Import difflib

Example:

Str = 'Shanghai Central Building' S1 = 'Shanghai Central Building' S2 = 'Shanghai Central Building' S2 = 'Shanghai Central Building' print (difflib.SequenceMatcher (None, Str, S1). Quick_ratio () print (difflib.SequenceMatcher (None, Str, S2). Quick_ratio () print (difflib.SequenceMatcher (None, Str, S3). Quick_ratio () 0.50.80.8333333333333334Python similarity assessment

"distance" is often used when assessing similarity:

1. When calculating the similarity of pictures, I used the cosine distance myself.

Are you kidding me? you're not studying geometry, how do you get to the cosine of the angle? Ladies and gentlemen, take it easy. The cosine of the angle in geometry can be used to measure the difference between the directions of two vectors, and this concept is used to measure the difference between sample vectors in machine learning.

(1) the cosine formula of the angle between vector A (x1) and vector B (x2) in two-dimensional space:

(2) two n-dimensional sample points a (x11cr x12, … , x1n) and b (x21thx22, … The angle cosine of x2n)

Similarly, for two n-dimensional sample points a (x11cr x12, … , x1n) and b (x21thx22, … X2n), a concept similar to the cosine of the angle can be used to measure the degree of similarity between them

That is:

The range of cosine of included angle is [- 1]. The larger the cosine angle is, the smaller the angle between the two vectors is, and the smaller the cosine angle is, the larger the angle between the two vectors is. When the directions of the two vectors coincide, the cosine of the angle takes the maximum value of 1, and the cosine of the angle of the two vectors takes the minimum value of-1 when the directions of the two vectors are completely opposite.

Import numpy as np# cosine similarity (method 1): def cosin_distance2 (vector1, vector2): user_item_matric = np.vstack ((vector1 Vector2)) sim = user_item_matric.dot (user_item_matric.T) norms = np.array ([np.sqrt (np.diagonal (sim))]) user_similarity = (sim / norms / norms.T) [0] [1] return user_similarity data = np.load ("data/all_features.npy") # sim = cosin_distance (data [22], data [828]) sim = cosin_distance2 (data [22] Data print (sim) # CoSine similarity (method 2) from sklearn.metrics.pairwise import cosine_similaritya = np.array ([1,2,8,4,6]) A1 = np.argsort (a) user_tag_matric = np.vstack ((a, A1)) user_similarity = cosine_similarity (user_tag_matric) print (user_similarity [0] [1]) # CoSine similarity (method 3) from sklearn.metrics.pairwise import pairwise_distancesa = np.array ([1,2,8) 4, 6]) A1 = np.argsort (a) user_tag_matric = np.vstack ((a, A1)) user_similarity = pairwise_distances (user_tag_matric, metric='cosine') print (1-user_similarity [0] [1])

It is important to note that the Cosine distance calculated with pairwise_distances is a 1-(cosine similarity) result

two。 Euclidean distance

Euclidean distance is the most easily understood method of distance calculation, which is derived from the distance formula between two points in Euclidean space.

# 1) given two data points, calculate the euclidean distance between themdef get_distance (data1, data2): points = zip (data1, data2) diffs_squared_distance = [pow (a-b, 2) for (a, b) in points] return math.sqrt (sum (diffs_squared_distance)) 3. Manhattan distance

The method of calculating this distance can be guessed from the name. Imagine you have to drive from one intersection to another in Manhattan. Is the driving distance a straight line between two points? Obviously not, unless you can cross the building. The actual driving distance is this "Manhattan distance". This is also the source of the name Manhattan distance, also known as City Block distance (CityBlock distance).

Def Manhattan (vec1, vec2): npvec1, npvec2 = np.array (vec1), np.array (vec2) return np.abs (npvec1-npvec2). Sum () # Manhattan_Distance,4. Chebyshev distance

Have you ever played chess? The king can move to any of the eight adjacent squares with one step. So how many steps does it take for the king to walk from the lattice (x1jiny1) to the lattice (x2jiny2)? Try to walk by yourself. You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |). There is a similar distance measurement called Chebyshev distance.

Def Chebyshev (vec1, vec2): npvec1, npvec2 = np.array (vec1), np.array (vec2) return max (np.abs (npvec1-npvec2)) # Chebyshev_Distance5. Minkowski distance

Min's distance is not a kind of distance, but a definition of a group of distances.

#! / usr/bin/env python from math import*from decimal import Decimal def nth_root (value,n_root): root_value=1/float (n_root) return round (Decimal (value) * * Decimal (root_value), 3) def minkowski_distance (XLECHERE): return nth_root (pow (abs (aMub), p_value) for aPerry b in zip (XMague y), p_value) print (minkowski_distance) -1], 3)) 6. Standardized Euclidean distance

Standardized Euclidean distance is an improved scheme aiming at the shortcomings of simple Euclidean distance. The idea of standard Euclidean distance: since the distribution of each dimensional component of the data is different, all right! Let me first "standardize" each component to mean and variance.

Def Standardized_Euclidean (vec1,vec2,v): from scipy import spatial npvec = np.array ([np.array (vec1), np.array (vec2)]) return spatial.distance.pdist (npvec, 'seuclidean', V=None) # Standardized Euclidean distance# http://blog.csdn.net/jinzhichaoshuiping/article/details/510194737. Def Mahalanobis (vec1, vec2): npvec1, npvec2 = np.array (vec1), np.array (vec2) npvec = np.array ([npvec1, npvec2]) sub = npvec.T [0]-npvec.T [1] inv_sub = np.linalg.inv (np.cov (npvec1, npvec2)) return math.sqrt (np.dot (inv_sub, sub) .dot (sub.T)) # MahalanobisDistance8. Edit distance def Edit_distance_str (str1, str2): import Levenshtein edit_distance_distance = Levenshtein.distance (str1, str2) similarity = 1-(edit_distance_distance/max (len (str1), len (str2) return {'Distance': edit_distance_distance,' Similarity': similarity} # Levenshtein distance

Where the input data is two arrays of the same dimension

After reading this, the article "case Analysis of string similarity in Python" has been introduced. If you want to master the knowledge points of this article, you still need to practice and use it yourself. If you want to know more about related articles, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.