In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Today, I will talk to you about how to analyze SimHash and duplicate information identification. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something from this article.
With the advent of the era of information explosion, the Internet is filled with a large number of near-repetitive information, so it is a meaningful topic to identify them effectively. For example, for the crawler system of search engines, it is meaningless to include duplicate web pages, which will only lead to a waste of storage and computing resources; at the same time, displaying repetitive information is not the best experience for users. The possible reasons for the near duplication of web pages include:
Mirror website
Content replication
Embedded advertisement
Count change
Small amount of modification
A simplified crawler system architecture is shown in the following figure:
In fact, most of the traditional methods to compare the similarity of two texts are to transform the text segmentation into the measure of feature vector distance, such as Euclidean distance, hamming distance or cosine angle and so on. Pairwise comparison can adapt well, but one of the biggest disadvantages of this method is that it can not be extended to massive data. For example, imagine that a large search engine like Google, which contains billions of Internet information, adds millions of new pages to its index library every day by crawling. if every piece of data to be included is to calculate the cosine angle with each record in the web page library, the amount of calculation is quite frightening.
We consider generating a fingerprint for each web document through hash. Traditional encrypted hash, such as md5, is designed to make the whole distribution as uniform as possible. Even if the input content changes slightly, the hash will change greatly. In our ideal hash function, we need to generate the same or similar hashcode for almost the same input, in other words, the similarity of hashcode should directly reflect the similarity of input. Obviously, the traditional hash such as md5 mentioned above cannot meet our needs.
Simhash is a kind of locality sensitive hash (locally sensitive hash), which was first proposed by Moses Charikar in "similarity estimation techniques from rounding algorithms". Google is based on this algorithm to achieve duplicate checking of web files. Let's assume the following three paragraphs:
The cat sat on the mat
The cat sat on a mat
We all scream for ice cream
Using traditional hash may result in the following results:
Quote
Irb (main): 006the cat sat on the mat' 0 > p1 = 'the cat sat on the mat'
Irb (main): 005 0 > p2 = 'the cat sat on a mat'
Irb (main): 007we all scream for ice cream' 0 > p3 ='we all scream for ice cream'
Irb (main): 007VR 0 > p1.hash
= > 415542861
Irb (main): 007VR 0 > p2.hash
= > 668720516
Irb (main): 007VR 0 > p3.hash
= > 767429688
Using simhash should produce a result similar to the following:
Quote
Irb (main): 003VR 0 > p1.simhash
= > 851459198
00110010110000000011110001111110
Irb (main): 004VR 0 > p2.simhash
= > 847263864
00110010100000000011100001111000
Irb (main): 002purl 0 > p3.simhash
= > 984968088
00111010101101010110101110011000
Hamming distance is defined as the number of different bits in two binary strings. The simhash results of the above three texts show that the hamming distances between the two pairings are (p1memp2) = 4, (p1Maginp3) = 16 and (p2memp3) = 12. In fact, this coincides with the similarity between texts. The similarity between p1 and p2 is much greater than that between p3.
How to implement this hash algorithm? Taking the above three texts as an example, the whole process can be divided into the following six steps:
1. Select the number of bits of simhash, taking into account the storage cost and the size of the data set, such as 32 bits.
2. Initialize the members of simhash to 0.
3. The features of the original text are generally extracted by various ways of word segmentation. For example, for "the cat sat on the mat", the results are as follows: {"th", "he", "e", "c", "ca", "at", "t", "s", "sa", "o", "on", "n", "t", "m", "ma"}.
4. Use the traditional 32-bit hash function to calculate the hashcode of each word, for example: "th" .hash =-502157718
, "he". Hash =-369049682,.
5. For each hashcode bit of each word, if the bit is 1, the value of the corresponding bit of simhash is increased by 1; otherwise, minus 1
6. For the final 32-bit simhash, if the bit is greater than 1, it is set to 1, otherwise it is set to 0
The whole process can be seen in the following figure:
According to Charikar in his thesis, texts with 64-bit simhash and hamming distance less than 3 can be regarded as near-repetitive texts. Of course, the specific value needs to be determined by combining specific business and empirical values.
The simhash generated by the above method can be used to compare the similarity between two texts. The question is, how to extend it to the near-repetition detection of massive data? For example, for the simhash code of 64-bit text to be queried, how to query records with a distance of less than 3 from hamming in a massive sample library (> 1m)? Before introducing the index structure of simhash, two general ideas are provided below. The first is to find the combination of all the changes within 3 bits of the 64-bit simhash code of the text to be queried, which requires more than 40, 000 queries. Refer to the following figure:
Another solution is to combine all the sample simhash code within 3 bits in the pre-generated database, which needs to occupy more than 40, 000 times the original space. Refer to the figure below:
Obviously, one of the above two methods, either time complexity or space complexity, can not meet the actual needs. We need a method whose time complexity is better than the former and space complexity is better than the latter.
Suppose we are looking for values within the hamming distance of 3. According to the drawer principle, as long as we divide the entire 64-bit binary string into 4 pieces, in any case, at least one area between the two matching simhash code is exactly the same, as shown in the following figure:
Since we can't know which area is exactly the same in advance, we have to store multiple copies of table. In this case, we need to store 4 copies of table and divide the 64-bit simhash code into 4 parts. For each input code, we find the same records in the first 16 bits as candidate records by exact matching, as shown in the following figure:
Let's summarize the essence of the above algorithm:
1. Divide the 64-bit binary string equally into four blocks
2. Adjust the above 64-bit binary system and take any block as the first 16 bits. There are four combinations to generate four copies of table.
3. Find the first 16 bits by exact matching
4. If there is a hash fingerprint of 2 ^ 34 (almost 1 billion) in the sample database, then each table returns 2 ^ (34-16) = 262144 candidate results, which greatly reduces the cost of calculating hamming distance.
We can extend this approach to multiple configurations, but keep in mind that the number of table varies with the result returned by each table, that is, you cannot have both time efficiency and space efficiency, as shown in the following figure:
In fact, this is what Google does every day to identify whether the acquired web pages overlap with its vast, billions of web page libraries. In addition, simhash can also be used for information clustering, file compression and so on.
After reading the above, do you have any further understanding of how to analyze SimHash and identify duplicate information? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.