Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize Fuzzy matching with Python+FuzzyWuzzy

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

Today, the editor will share with you the relevant knowledge points about how to achieve fuzzy matching in Python+FuzzyWuzzy. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look.

1. Preface

In the process of processing data, you will inevitably encounter the following similar scenarios, in which you get a simplified version of the data field, but the data to be compared or merged is the full version (sometimes vice versa).

One of the most common examples is: in geographic visualization, only the abbreviations of the data collected by yourself are retained, such as Beijing, Guangxi, Xinjiang, Xizang, etc., but the field data to be matched are Beijing and Guangxi Zhuang Autonomous region. Xinjiang Uygur Autonomous region, Xizang Autonomous region and so on, as follows. Therefore, there needs to be a way to quickly and easily match the corresponding fields and generate a separate column of the results, which can be used in the FuzzyWuzzy library.

2. Introduction to FuzzyWuzzy Library

FuzzyWuzzy is an easy-to-use fuzzy string matching toolkit. It calculates the difference between the two sequences according to the Levenshtein Distance algorithm.

The Levenshtein Distance algorithm, also known as the Edit Distance algorithm, refers to the minimum number of editing operations required between two strings from one to the other. Permitted editing operations include replacing one character with another, inserting a character, and deleting a character. Generally speaking, the smaller the editing distance, the greater the similarity between the two strings.

The jupyter notebook programming environment under Anaconda is used here, so enter the instructions in the command line of Anaconda to install the third-party library.

Pip install-I https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy2.1 fuzz module

This module mainly introduces four functions (methods), which are: simple matching (Ratio), imperfect matching (Partial Ratio), ignoring sequential matching (Token Sort Ratio) and deduplicating subset matching (Token Set Ratio).

Note: if you directly import this module, the system will prompt warning, of course, this does not mean that the error report, the program can still run (using the default algorithm, slow execution), you can follow the system prompts to install python-Levenshtein library to assist, which helps to improve the speed of calculation.

2.1.1 simple matching (Ratio)

Just a simple understanding, this is not very accurate, and it is not commonly used.

Fuzz.ratio (Henan Province, Henan Province) > > 100 > fuzz.ratio (Henan Province, Henan Province) > 80

2.1.2 non-exact match (Partial Ratio)

Try to use incomplete matching with high precision

Fuzz.partial_ratio (Henan Province, Henan Province) > > 100fuzz.partial_ratio (Henan Province, Henan Province) > 100

2.1.3 ignore sequence matching (Token Sort Ratio)

The principle is: use spaces as delimiters, lowercase all letters, and ignore other punctuation marks

Fuzz.ratio ("Xizang Autonomous region", "Autonomous region Xizang") > 50fuzz.ratio ('I love YOU','YOU LOVE I') > 30fuzz.token_sort_ratio ("Xizang Autonomous region", "Autonomous region Xizang") > 100fuzz.token_sort_ratio ('I love YOU','YOU LOVE I') > 100

2.1.4 de-duplicated subset matching (Token Set Ratio)

It is equivalent to a process of deduplicating sets before alignment. Pay attention to the last two. It can be understood that this method adds the function of deduplicating sets on the basis of the token_sort_ratio method. The following three matches are all in reverse order.

Fuzz.ratio ("Xizang Xizang Autonomous region", "Autonomous region Xizang") > 40fuzz.token_sort_ratio ("Xizang Xizang Autonomous region", "Autonomous region Xizang") > 80fuzz.token_set_ratio ("Xizang Xizang Autonomous region", "Autonomous region Xizang") > 100

Fuzz these ratio () function (method) the final results are numbers, if you need to get the most matching string results, but also need to choose different functions of their own data types, and then extract the results, if you can see the matching degree of text data using this way is quantifiable, but for us to extract matching results is not very convenient, so there is a process module.

2.2 process module

Used to deal with situations with limited alternative answers, returning fuzzy matching strings and similarity.

2.2.1 extract extracts multiple pieces of data

Similar to select in a crawler, a list is returned, which contains a lot of matching data

Choices = [Henan Province "," Zhengzhou City "," Hubei Province "," Wuhan City "] process.extract (" Zhengzhou ", choices, limit=2) > [('Zhengzhou', 90), ('Henan', 0)] the data type after # extract is a list, even though limit=1 is the last list, note the difference with the following extractOne

2.2.2 extractOne extracts a piece of data

If you want to extract the result with the most matching degree, you can use extractOne. Note that the tuple type is returned here, and the result with the most matching degree is not necessarily the data we want. You can experience it through the following example and two practical applications

Process.extractOne (Zhengzhou, choices) > > (Zhengzhou, 90) process.extractOne (Beijing, choices) > (Hubei Province, 45) 3. Practical application

Here are two small examples of practical applications, the first is the fuzzy matching of the company name field, and the second is the fuzzy matching of the provincial and municipal fields.

3.1 Fuzzy matching of company name field

The data and the data styles to be matched are as follows: the name of the data field obtained by yourself is very simple and is not the full name of the company, so it is necessary to merge the two fields.

The code is directly encapsulated as a function, mainly to facilitate future calls. Here, the parameters are set in detail, and the execution result is as follows:

3.1.1 explanation of parameters:

The first parameter of ①, df_1, is the left data to be merged (here is the data variable).

The second parameter df_2 of ② is the data to be merged on the right to be matched (here is the company variable)

The third parameter of ③, key1, is the name of the field to be processed in df_1 (here is the 'company name' field in the data variable)

The fourth parameter key2 of ④ is the name of the field to match in df_2 (here is the 'company name' field in the company variable)

The fifth parameter threshold of ⑤ is the standard for setting the matching degree of extraction results. Note that this is the improvement of the extractOne method. The result of the maximum matching degree extracted is not necessarily what we need, so we need to set a threshold to judge, this value is 90, only if it is greater than or equal to 90, we can accept the matching result.

The sixth parameter of ⑥. The default parameter is to return only two successful matching results.

⑦ return value: new DataFrame data after adding the 'matches' field to df_1

3.1.2 Core Code explanation

The first part of the code is as follows, you can refer to the process.extract method explained above, which is used directly here, so the returned result m is the data format of the nested meta-ancestors in the list, with the style of [('Zhengzhou', 90), ('Henan', 0)], so this is the format of the data written into the 'matches' field for the first time.

Note, Note: the first one in the meta-ancestor is the matching string, and the second is the numeric object compared by the set threshold parameter.

S = df_ 2 [key2] .tolist () m = df_ 1 [key1] .apply (lambda x: process.extract (x, s, limit=limit)) df_1 ['matches'] = m

The core code of the second part is as follows. With the above combing, the data type in the 'matches' field is defined, and then the data is extracted. There are two points to pay attention to in the part that needs to be processed:

① extracts the string that matches successfully and fills in null values for data whose threshold is less than 90

② finally adds the data to the 'matches' field

M2 = df_1 ['matches'] .apply (lambda x: [I [0] for i in x if i [1] > = threshold] [0] if len ([I [0] for i in x if i [1] > = threshold]) > 0 else') # to understand what the data type returned by the first 'matches' field looks like It is not difficult to understand this line of code # refer to this format: [('Zhengzhou', 90), ('Henan', 0)] df_1 ['matches'] = m2return df_13.2 province field fuzzy match

The image is shown in the background introduction of your own data and the data to be matched, and the fuzzy matching function has also been encapsulated. Here, you can directly call the above function and enter the corresponding parameters. The code and the execution result are as follows:

After the data processing is completed, the encapsulated function can be placed directly under the self-defined module name file, and the function name can be easily imported later. You can refer to encapsulating some commonly used custom functions into module methods that can be called directly.

4. All function codes # Fuzzy matching def fuzzy_merge (df_1, df_2, key1, key2, threshold=90, limit=2): "": param df_1: the left table to join: param df_2: the right table to join: param key1: key column of the left table: param key2: key column of the right table: param threshold: how close the matches should be to return a match, based on Levenshtein distance: param limit: the amount of matches that will get returned These are sorted high to low: return: dataframe with boths keys and matches "" s = df_ 2 [key2] .tolist () m = df_ 1 [key1] .apply (lambda x: process.extract (x, s) Limit=limit) df_1 ['matches'] = m ~ 2 = df_1 [' matches'] .apply (lambda x: [I [0] for i in x if i [1] > = threshold] [0] if len ([I [0] for i in x if i [1] > = threshold]) > 0 else') df_1 ['matches'] = m2 return df_1from fuzzywuzzy import fuzzfrom fuzzywuzzy import processdf = fuzzy_merge (data, company,' company name', 'company name' Threshold=90) the above df is all the content of the article "how to achieve fuzzy matching in Python+FuzzyWuzzy" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report