Case Analysis of URL de-duplication method 04/27 Update SLTechnology News&Howtos

Case Analysis of URL de-duplication method

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the "case analysis of URL de-duplication method". In the daily operation, I believe that many people have doubts about the case analysis of URL de-duplication method. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "URL de-duplication method case analysis". Next, please follow the editor to study!

URL deweighting the train of thought

Regardless of the business scenario and the amount of data, we can use the following scheme to achieve repeated judgment of URL:

Use the Set collection of Java to determine whether the URL is duplicated according to the result when you add it (if you add it successfully, the URL does not repeat).

Use the Set collection in Redis to determine whether the URL is duplicated according to the result when you add it.

Store all the URL in the database, and then determine whether there is a duplicate URL through the SQL statement

Set the URL column in the database as the unique index, and judge whether the URL is duplicated according to the result of adding.

Use Guava's Bloom filter to achieve URL weight judgment

Use Redis's Bloom filter to achieve URL weight judgment.

The specific implementation of the above scheme is as follows.

URL de-repetition implementation scheme 1. Weight determination using the Set collection of Java

The Set collection is inherently non-repeatable. It can only be used to store elements with different values. If the value is the same, it will fail. Therefore, we can determine whether the URL is duplicated by the result of adding the Set collection. The implementation code is as follows:

Public class URLRepeat {/ / to be deduplicated URL public static final String [] URLS = {"www.apigo.cn", "www.baidu.com", "www.apigo.cn"}; public static void main (String [] args) {Set set = new HashSet (); for (int I = 0; I < URLS.length; iTunes +) {String url = URLS [I] Boolean result = set.add (url); if (! result) {/ / duplicate URL System.out.println ("URL already exists:" + url);}

The execution result of the program is:

URL already exists: www.apigo.cn

From the above results, we can see that the weight judgment function of URL can be realized by using Set set.

2.Redis Set set to remove repetition

The implementation idea of using Redis's Set collection is the same as that of Set collection in Java, both of which are realized by the non-repeatability of Set. Let's first use Redis's client-side redis-cli to implement an example of URL duplication.

As you can see from the above results, when the addition is successful, it indicates that the URL is not duplicated, but when the addition fails (the result is 0), the URL already exists.

Let's use code to implement the Set deduplication of Redis. The implementation code is as follows:

/ URLpublic static final String [] URLS = {"www.apigo.cn", "www.baidu.com", "www.apigo.cn"}; @ AutowiredRedisTemplate redisTemplate;@RequestMapping ("/ url") public void urlRepeat () {for (int I = 0; I < URLS.length; I +) {String url = URLS [I]; Long result = redisTemplate.opsForSet () .add ("urlrepeat", url) If (result = = 0) {/ / duplicate URL System.out.println ("URL already exists:" + url);}

The results of the above procedures are as follows:

URL already exists: www.apigo.cn

In the above code, we use RedisTemplate in Spring Data. To use the RedisTemplate object in the Spring Boot project, we need to introduce the spring-boot-starter-data-redis framework first. The configuration information is as follows:

Org.springframework.boot spring-boot-starter-data-redis

Then you need to configure the connection information of Redis in the project, and configure the following in application.properties:

Spring.redis.host=127.0.0.1spring.redis.port=6379#spring.redis.password=123456 # Redis server password. You need to configure this if you have a password.

After the above two steps, we can normally use the RedisTemplate object to manipulate Redis in the Spring Boot project.

3. Database deduplication

We can also use the database to realize the repeated judgment of URL. First of all, let's design a storage table of URL.

The corresponding SQL for this table is as follows:

/ * Table: urlinfo * / / * = = * / create table urlinfo (id int not null auto_increment, url varchar (1000), ctime date, del boolean, primary key (id)) / * Index: Index_url * / / * = = * / create index Index_url on urlinfo (url)

Where id is the self-increasing primary key, and the url field is set to index, setting the index can speed up the query.

Let's first add two pieces of test data to the database

We use the SQL statement to query

If the result is greater than 0, there is already a duplicate URL, otherwise there is no duplicate URL.

4. Unique index de-duplication

We can also use the unique index of the database to prevent URL duplication, which is very similar to the idea of the previous Set collection.

First of all, we set a unique index for the field URL, and then add URL data. If you can add it successfully, it means that the URL is not duplicated, otherwise it means repetition.

The SQL implementation for creating a unique index is as follows:

Create unique index Index_url on urlinfo (url); 5.Guava Bloom filter for weight removal

The Bloom filter (Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. The Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are far higher than the general algorithm, and the disadvantage is that it has a certain error recognition rate and deletion difficulties.

The core implementation of the Bloom filter is a very large bit array and several hash functions, assuming that the length of the bit array is m and the number of hash functions is k.

The above figure is an example of the specific operation flow: suppose there are three elements {x, y, z} in the set, and the number of hash functions is 3. First initialize the bit array and set the bit 0 for each bit in it. For each element in the set, the elements are mapped through three hash functions in turn, and each mapping produces a hash value, which corresponds to a point on the bit array, and then marks the corresponding position of the bit array as 1. When querying whether the W element exists in the set, the same method maps W to three points on the array by hashing. If one of the three points is not 1, you can judge that the element must not exist in the collection. Conversely, if all three points are 1, the element may exist in the collection. Note: it is not possible to determine whether the element must exist in the collection, there may be a certain misjudgment rate. You can see from the figure: suppose an element corresponds to the subscript 4, 5, and 6 by mapping. Although these three points are all 1, it is obvious that these three points are the positions of different elements obtained by hashing, so this situation shows that although the elements are not in the set, they may all correspond to 1, which is the reason for the existence of misjudgment rate.

We can use the Guava framework provided by Google to operate the Bloom filter, so that we can first add a reference to Guava in the pom.xml, as follows:

Com.google.guava guava 28.2-jre

The implementation code of URL judgment:

Public class URLRepeat {/ / to be duplicated URL public static final String [] URLS = {"www.apigo.cn", "www.baidu.com", "www.apigo.cn"} Public static void main (String [] args) {/ / create a Bloom filter BloomFilter filter = BloomFilter.create (Funnels.stringFunnel (Charset.defaultCharset ()), 10, / / expected number of elements to be processed); / / expected false positive probability for (int I = 0; I < URLS.length) String url +) {String url = URLS [I]; if (filter.mightContain (url)) {/ / use duplicate URL System.out.println ("URL already exists:" + url);} else {/ / stores URL in Bloom filter filter.put (url) }

The results of the above procedures are as follows:

URL already exists: www.apigo.cn

Weight removal by 6.Redis Bloom filter

In addition to Guava's Bloom filter, we can also use Redis's Bloom filter to achieve URL weight judgment. Before using it, we need to make sure that the Redis server version is greater than 4.0 (the Bloom filter is supported only after this version), and that the Redis Bloom filter feature is enabled before it can be used properly.

Taking Docker as an example, let's demonstrate that the Redis Bloom filter is installed and turned on. First download Redis's Bloom filter, and then turn on the Bloom filter when the Redis service is restarted.

Use of Bloom filter: after the Bloom filter is enabled normally, we first use the client redis-cli of Redis to implement the duplicate decision of Bloom filter URL. The implementation command is as follows:

In Redis, there are few operation commands for Bloom filter, which mainly include the following:

-bf.add add elements

-bf.exists determines whether an element exists

-bf.madd adds multiple elements

-bf.mexists determines whether multiple elements exist

-bf.reserve sets the accuracy of the Bloom filter.

Let's use the code to demonstrate the use of the Redis Bloom filter:

Import redis.clients.jedis.Jedis;import utils.JedisUtils;import java.util.Arrays;public class BloomExample {/ / Bloom filter key private static final String _ KEY = "URLREPEAT_KEY"; / / URL public static final String [] URLS = {"www.apigo.cn", "www.baidu.com", "www.apigo.cn"} Public static void main (String [] args) {Jedis jedis = JedisUtils.getJedis (); for (int I = 0; I < URLS.length; iTunes +) {String url = URLS [I]; boolean exists = bfExists (jedis, _ KEY, url) If (exists) {/ / duplicate URL System.out.println ("URL already exists:" + url);} else {bfAdd (jedis, _ KEY, url) }} / * add elements * @ param jedis Redis client * @ param key key * @ param value value * @ return boolean * / public static boolean bfAdd (Jedis jedis, String key, String value) {String luaStr = "return redis.call ('bf.add', KEYS [1], KEYS [2])" Object result = jedis.eval (luaStr, Arrays.asList (key, value), Arrays.asList (); if (result.equals (1L)) {return true;} return false } / * query whether the element exists * @ param jedis Redis client * @ param key key * @ param value value * @ return boolean * / public static boolean bfExists (Jedis jedis, String key, String value) {String luaStr = "return redis.call ('bf.exists', KEYS [1], KEYS [2])" Object result = jedis.eval (luaStr, Arrays.asList (key, value), Arrays.asList (); if (result.equals (1L)) {return true;} return false;}}

The results of the above procedures are as follows:

URL already exists: www.apigo.cn

At this point, the study of "case analysis of URL de-repetition method" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.