What is the text weight removal principle of simhash? 04/18 Update SLTechnology News&Howtos

What is the text weight removal principle of simhash?

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what is the text weight removal principle of simhash". In the daily operation, I believe that many people have doubts about the text weight removal principle of simhash. The editor has consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "what is the text weight removal principle of simhash". Next, please follow the editor to study!

There are a large number of duplicate pages and texts on Internet pages, whether for search engines, crawler pages, anti-piracy and tracking of content sites such as news and novels, or de-duplication and clustering of texts such as social media. all need to de-duplicate and filter the page or text. For this reason, there must be a set of efficient de-duplication algorithm, otherwise the crawler will do a lot of useless work, timeliness and so on can not be guaranteed, more importantly, the user experience is not good. There are many algorithms about text fingerprint de-duplication in the industry, such as k-shingle algorithm, simhash algorithm proposed by google, Minhash algorithm, Baidu top k longest sentence signature algorithm and so on. Simhash algorithm and python application are introduced below.

1. The difference between simhash and traditional hash

Simhash is an algorithm used by google to deal with massive text deduplication. Simhash can convert a document into a 64-bit byte, which is temporarily called a signature. To judge whether the document is repeated, we only need to judge the hamming distance between the feature words of the document. According to experience, when the hamming distance between the feature words of two documents is less than 3, it can be judged that the two documents are similar.

The traditional Hash algorithm is only responsible for mapping the original content to a signature value uniformly and randomly as far as possible, which is only equivalent to the pseudo-random number generation algorithm in principle. The traditional hash algorithm produces two signatures, if the original content is equal under a certain probability; if not, it no longer provides any information except that the original content is not equal, because even if the original content is only one byte different, the resulting signature is likely to be very different. Therefore, the traditional Hash can not measure the similarity of the original content in the dimension of the signature, while simHash itself belongs to a locally sensitive hashing algorithm, and the hash signature generated by it can represent the similarity of the original content to a certain extent.

We mainly solve the text similarity calculation, to compare whether the two articles know each other, of course, we reduced the dimension to generate the hash signature is also used for this purpose. If you see here, you will understand that the simhash we use can still be used to calculate similarity even if the string in the article is changed into 01 string, but the traditional hash is not. We can do a test with two text strings with a difference of only one character, "your mother told you to come home for dinner" and "your mother told you to go home for dinner".

The result calculated by simhash is as follows:

1000010010101101111111100000101011010001001111100001001011001011

1000010010101101011111100000101011010001001111100001101010001011

Calculated by traditional hash as:

0001000001100110100111011011110

1010010001111111110010110011101

As you can see, only part of the 01 string of similar text has changed, but ordinary hash can't do it. This is the charm of local sensitive hash.

2. Main steps of simhash implementation

After getting the new text, we need to do word segmentation first, this is because we need to pick out the word TopN to represent the text, and the weight of the word segmentation is different, we can use the tf-idf value of the corresponding data set as the weight of the word segmentation, so it is divided into the result of word segmentation with weight.

After that, we hash all the participles to get the binary hash result, then multiply the weight with the hash value to get the weighted hash value, and finally carry on the accumulation and binarization processing.

2.1 participle

The text is divided into feature vectors of keywords by means of word segmentation. many word segmentation methods are generally notional words, that is, after the discontinued words are removed, users can choose according to their own needs. Finally, a sequence of words without noise is formed and weight is added to each word. For example:

Walker AI specializes in the game field for many years AI technology accumulation one-stop to provide text, picture, audio and video content review game AI and data platform services

The current word is only segmented, but the amount of information contained in the word is different. For example, the three words of Walker AI Game Review are better able to express the main meaning of the text than to focus on service, which is the so-called concept of information entropy.

For this reason, we also need to set the weight of feature words, which can be expressed by absolute word frequency, that is, the number of times a keyword appears, but in fact, less occurrence times may contain more information. In short, you need to choose a weighting method, otherwise the effect will be reduced.

2.2 hashing and weighting

Previously, we used word segmentation and weight distribution to divide the text into several content words with weights. For example, the weight is represented by a number of 1-5, with 1 as low as 5 and the highest.

Traveler AI (5) focuses (2) on (1) games (3) areas (1) years of AI technology (4) accumulation (1) one-stop shop (2) provision of (1) text (2) pictures (2) audio and video (2) content (1) review (2) game AI (4) and (1) data (3) platform (1) services (2)

Calculate the binary hash value of each feature word, then accumulate all the hash values, and finally binarize the cumulative result.

2.3 hamming distance

In information theory, the hamming distance (English: Hamming distance) between two equal-length strings is the number of different characters in the corresponding positions of two strings. In other words, it is the number of characters that need to be replaced to transform one string into another.

Hamming weight is the hamming distance of a string relative to a zero string of the same length, that is, it is the number of non-zero elements in a string: for a binary string, it is the number of ones, so the hamming weight of 11101 is 4.

For binary strings an and b, it is equal to the number of "1s" in the binary string resulting from a XOR b.

Hamming distance is named after Richard Wesley hamming, who first introduced this concept in his basic paper on error detection and correction codes.

The error data bits that flip occurs in fixed-length binary words in communication, so it is also called signal distance. Hamming weight analysis is applied in many fields, including information theory, coding theory, cryptography and so on. However, if you want to compare two strings of different length, not only to replace, but also to insert and delete operations, in this case, usually use more complex editing distance and other algorithms.

After engineering verification, Google believes that when the hamming distance of two 64bit binary Simhash values is more than 3, it is considered not similar, so the weight judgment problem is transformed into the hamming distance problem of two hash values.

3. Python implementation

There are several simhash implementations in the pip source, simhash, which is very easy to use and can be installed by using pip directly.

Pip install simhash

Use examples

From simhash import Simhashdef simhash_demo (text1, text2): "" find the similarity between the two texts: param text1::param text2::return: "" a_simhash = Simhash (text1) b_simhash = Simhash (text2) max_hashbit = max (len (bin (a_simhash.value) (len (bin (b_simhash.value) # hamming distance distince = a_simhash.distance (b_simhash) print (distince) similar = 1-distince / max_hashbitreturn similarif _ name__ = ='_ _ main__':text1 = "Walker AI is focused on the game field, with years of AI technology accumulation, one-stop service to provide text, picture, audio / video content review, game AI and data platform service" text2 = "Walker AI focuses on the game field. Years of AI technology accumulation, two-stop to provide text, pictures, audio and video content review, game AI and data platform service "similar = simhash_demo (text1, text2) print (similar) to this point." The study on "what is the principle of text de-duplication of simhash" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.