How to use minhash 07/12 Update SLTechnology News&Howtos

How to use minhash

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to use minhash. Xiaobian thinks it is very practical, so share it with everyone. I hope you can gain something after reading this article. Let's not say much. Let's take a look at it together with Xiaobian.

in the actual application process. Similarity measurement and computation is a method that is often used. For example, web page de-duplication, inference whether posts are similar, recommendation systems measure similarity of items or users, and so on. When data volumes are large, computational time and space complexity can be an important issue, such as inferring similar posts. We can use kmeans to cluster. However, the consumption of resources is enormous. So this article recommends a method, minhash+lsh (local sensitive hash), using minhash to reduce dimensions. Use lsh to do approximate query, the following mainly introduces minhash.

Before we introduce minhash, we give a measure of similarity.

1. measure of similarity

There are many ways to measure similarity, and Euclidean distance is often used. Here we use the Jaccard similarity coefficient, for example, the following formula

The calculation method is very easy. The number of words document A and document B have in common divided by the set of A and B words. For example, A={a,b,c,d}, B={c,d,e,f}, then the similarity coefficient is 2/6=0.33.

2. minhash

We know that we used documents and words in similarity. Usually, we express documents and words as doc-term matrices, and being able to see what the term details have no effect on the final result. So I simply use line number to represent term, line number and term are one-to-one corresponding. such as

S1, S2, S3 in the first row represent documents, and 01234 in the first column represents line numbers. Words. The other part 1 means there is this word in document S, 0 means there is no this word, and with this set, let's see how minhash does it.

Randomly determine an order. For example, the sequence above is 01234. Randomly determine an order, such as 12340. Notice that this is random. The goal is to keep the final result free of artificial interference. Results such as

We see that the row number is the same, the row number is the same, and what changes is the content of the matrix. Find the first 1 to represent this document, for example, S1 at the top of the 1 line number is 2, then use 2 to represent document S1, similarly, use 1 to represent S2, then you can calculate the similarity coefficient between S1 and S2, 1 intersection 2 divided by 1 and 2 equals 0.

A proof of why this method is reasonable will be given later. Let's skip ahead. Imagine that using a word to represent a document would be more accidental, so at this time our idea might be to be able to randomly generate multiple transformations and take out multiple words for comparison. At this time, the problem comes. In the actual application process, there may be millions of documents and tens of thousands of words. The cost of time and space for transforming such a huge matrix will be relatively large. Is there another way? Yes, we know that motion is relative. Before is the transformation matrix content unchanged row number. Now we are invariant matrix, just change the row number, is not the amount of calculation much less?

So the problem turns into how to generate random line numbers. We can generate the order of line numbers with the hash function. Both functions can be customized. It is best to ensure that the value after hash is uniform. For example, x+1mod5, 3x+1mod5. We choose these two hash functions to generate the order of line numbers. Look at where we are today.

We use h2, h3 two hash functions to generate two line number sequences, then the next step is the key step

For example, find the value of document s1. Go through the corresponding words of s1

Line 0 to Line 4

1. Line number 0 is line number 1, and if you look at h2, the calculated line number is 1. Assign h2 to 1 (the line number). continue to traverse

2. Behavior 1 0, Don't Care, Skip

3. Act 2 0, don't care. skip

4. Line 3, line 1, looks at h2 and calculates line number 4. 4 is greater than the value of h2 at this time, and the value of h2 remains unchanged. Assuming less than the value of h2 at this time, pay the value to h2

5. 4th act 0. Don't care, skip

After traversing, the value of h2 is 1 at this time, and you can see that. What we're actually doing is going through the values in the matrix, not caring about zero. Skip it. to 1. Look at the line numbers generated by the hash function and find the value with the smallest line number as the output value of h2. Similarly, h3 is the same, and we end up with, for example, the following matrix

At this time, we can calculate the similarity between s1 and s2, J=0/3=0.

3. Why Minhash's approach is reasonable

Problem: The probability that the minhash values of a random row permutation of two sets are equal and the Jaccard similarity of two sets is equal

Proof, for example:

Two sets. A、B。For a line. They come in three states.

X: A and B are both 1, which means that there is this word in both A and B sets.

Y: One of A and B is 1, and one of them is not 1, that is, one has this word and one does not.

Z: A and B are both 0, which means that neither A nor B has this word.

If x behaves X, y behaves Y, z behaves Z. This is the jaccard coefficient x/(x+y). Looking at minhash again, since the permutation is random, the probability of encountering X before encountering Y is x/(x+y). Is exactly equal to the value of the jaccard coefficient.

4. How to compare similar queries

through the previous method. We compressed the document, using, for example, 30 hash functions. Then a document is compressed into a 30-bit representation. The next question is how to query.

One way to think about it is to create a reverse order, where term is a word and doclist is the other documents that have that word.

Another way of thinking is not to create an inverted order of single words, but to create a joint inverted order of multiple words, such as

A document: expressed in 30 bits in the previous way, the 30 is divided into m buckets, r words per barrel, that is, m*r=30, this inverted arrangement is established: term is r words (or these r words are hashcode), doclist is the document with these r words.

So the problem here is. M, R how to solve, m is equal to a few good. R is good.

Assuming that two documents have a similarity of p, then the probability of similarity in the corresponding number of bits is also p, then the probability of being exactly the same in a bucket is p^r, the probability of being different is 1-p^r, and the probability of not being the same in m buckets is (1-p^r)^m. So at least one bucket has the same probability 1-(1-p^r)^m, and we can assign m and r according to the probability p we want.

The final set up is inverted this way.

Bucket 1-->doc1, doc2, doc3, doc4

Bucket 2-->doc2, doc5, doc9, doc10

Once the index is established, the next step is to retrieve a new document. It also goes through a comprehensive process to get the corresponding bucket. For example, bucket 1, bucket 3, then the next step is to query with bucket 1 to get documents similar to this document. To make sure it's similar. It is also possible to calculate the similarity of candidate documents with this film document

The above is how minhash should be used. Xiaobian believes that some knowledge points may be seen or used in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.