How to use ClickHouse to quickly judge the similarity between two sets 07/11 Update SLTechnology News&Howtos

How to use ClickHouse to quickly judge the similarity between two sets

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use ClickHouse to quickly judge the similarity of two sets", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use ClickHouse to quickly judge the similarity between two sets.

In business, we often encounter the need to check duplicates, such as giving a text string to determine whether there is anything similar to it in the existing document.

There are many ways to achieve this function, one efficient way is to use SinHash to reduce the dimension of the data into a series of hash values, and then use hamming distance (Hamming Distance) to compare the similarity between the two.

SinHash is a locally sensitive hash algorithm, which is especially suitable for scenarios with massive data.

It just so happens that the related functions of MinHash and hamming distance have been built into ClickHouse, and the relevant PR is here:

Https://github.com/ClickHouse/ClickHouse/pull/7649 .

Next, let's find an example to experience.

Prepare four text strings and use the SimHash function to calculate their hash values:

SELECT ngramSimHash ('traditional hash algorithm is only responsible for mapping the original content to a signature value as evenly and randomly as possible, which in principle is equivalent to a pseudo-random number generation algorithm.') AS sh2, ngramSimHash ('the traditional hash algorithm is only responsible for mapping the original content to a signature value as evenly and randomly as possible, which is equivalent to a pseudo-random number generation algorithm in principle.') AS sh3, ngramSimHash ('the traditional hash algorithm is only responsible for mapping the original content to one as evenly and randomly as possible, which is equivalent to a pseudo-random number generation algorithm in principle.') AS sh4, ngramSimHash ('SimHash itself is a locally sensitive hash algorithm, and the Hash signature generated by it can represent the similarity of the original content to a certain extent.') AS sh5

Query id: 7cf4a1d1-266f-4638-a75c-88ab1d93dbdf

┌─ sh2 ─┬─ sh3 ─┬─ sh4 ─┬─ sh5 ─┐│ 20645847 │ 20645847 │ 54200087 │ 957490773 │└─┴─┘

1 rows in set. Elapsed: 0.004 sec.

From the hash value, sh2 and sh3 are two identical texts, while sh4 and sh5 are different from sh2, but we can't judge their similarity by hash value, so we need to use hamming distance at this time.

Use the bitHammingDistance function to calculate the difference distance between hash values:

SELECT bitHammingDistance (sh2, sh3) AS `1and 2`, bitHammingDistance (sh2, sh4) AS `1and 3`, bitHammingDistance (sh2, sh5) AS `1and 4`from (SELECT ngramSimHash ('traditional hash algorithm is only responsible for mapping the original content to a signature value as evenly as possible, which is equivalent to pseudo-random number generation algorithm in principle.') AS sh2, ngramSimHash ('the traditional hash algorithm is only responsible for mapping the original content to a signature value as evenly and randomly as possible, which is equivalent to a pseudo-random number generation algorithm in principle.') AS sh3, ngramSimHash ('the traditional hash algorithm is only responsible for mapping the original content to one as evenly and randomly as possible, which is equivalent to a pseudo-random number generation algorithm in principle.') AS sh4, ngramSimHash ('SimHash itself is a locally sensitive hash algorithm, and the Hash signature generated by it can represent the similarity of the original content to a certain extent.') AS sh5)

Query id: c5b24238-cf85-4eb0-a77c-0b82a888a439

┌─ 1and2 ─┬─ 1and3 ─┬─ 1and4 ─┐│ 0 │ 3 │ 10 │└─┴─┴─┘

1 rows in set. Elapsed: 0.004 sec.

From the results, we can see:

The hamming distance between sh2 and sh3 is 0, so there is no difference between them.

The distance between sh2 and sh4 is 3. According to experience, the similarity of two paragraphs of text with a distance less than 3 is very high.

The distance between sh2 and sh5 is 10, much greater than 3, so they are different.

At this point, I believe you have a deeper understanding of "how to use ClickHouse to quickly judge the similarity of two sets". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.