What are the methods of removing weight in URL 07/13 Update SLTechnology News&Howtos

What are the methods of removing weight in URL

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge points of this article, "what are the ways to remove weight from URL?", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this article. "what are the ways to remove weight from URL?"

URL is often encountered in our daily work and interviews, such as these:

It can be seen that, including Ali, NetEase Yun, Youku, homework gang and other well-known Internet companies have appeared similar face-to-face test questions, and similar to URL de-repetition, such as IP black / whitelist judgment and so on often appear in our work, so we will come to "a disk" URL deduplication problem.

URL deweighting the train of thought

Regardless of the business scenario and the amount of data, we can use the following scheme to achieve repeated judgment of URL:

Use the Set collection of Java to judge whether the URL is repeated according to the result when you add it (successful addition means that URL does not repeat); use the Set collection in Redis to judge whether the URL is repeated according to the result when you add it; store all the URL in the database, and then use the SQL statement to determine whether there is a duplicate URL; in the database to set the URL column in the database as the unique index, and judge whether the URL is duplicated according to the result when you add it. Use Guava's Bloom filter to achieve URL weight judgment; use Redis's Bloom filter to achieve URL weight judgment.

The specific implementation of the above scheme is as follows.

URL de-repetition implementation scheme 1. Weight determination using the Set collection of Java

The Set collection is inherently non-repeatable. It can only be used to store elements with different values. If the value is the same, it will fail. Therefore, we can determine whether the URL is duplicated by the result of adding the Set collection. The implementation code is as follows:

Public class URLRepeat {

/ / to be re-URL

Public static final String [] URLS = {

"www.apigo.cn"

"www.baidu.com"

"www.apigo.cn"

}

Public static void main (String [] args) {

Set set = new HashSet ()

For (int I = 0; I

< URLS.length; i++) { String url = URLS[i]; boolean result = set.add(url); if (!result) { // 重复的 URL System.out.println("URL 已存在了：" + url); } } } } 程序的执行结果为： URL 已存在了：www.apigo.cn 从上述结果可以看出，使用 Set 集合可以实现 URL 的判重功能。 2.Redis Set 集合去重使用 Redis 的 Set 集合的实现思路和 Java 中的 Set 集合思想思路是一致的，都是利用 Set 的不可重复性实现的，我们先使用 Redis 的客户端 redis-cli 来实现一下 URL 判重的示例：

As you can see from the above results, when the addition is successful, it indicates that the URL is not duplicated, but when the addition fails (the result is 0), the URL already exists.

Let's use code to implement the Set deduplication of Redis. The implementation code is as follows:

/ / to be re-URL

Public static final String [] URLS = {

"www.apigo.cn"

"www.baidu.com"

"www.apigo.cn"

}

@ Autowired

RedisTemplate redisTemplate

@ RequestMapping ("/ url")

Public void urlRepeat () {

For (int I = 0; I

< URLS.length; i++) { String url = URLS[i]; Long result = redisTemplate.opsForSet().add("urlrepeat", url); if (result == 0) { // 重复的 URL System.out.println("URL 已存在了：" + url); } } } 以上程序的执行结果为： URL 已存在了：www.apigo.cn 以上代码中我们借助了 Spring Data 中的 RedisTemplate 实现的，在 Spring Boot 项目中要使用 RedisTemplate 对象我们需要先引入 spring-boot-starter-data-redis 框架，配置信息如下： org.springframework.boot spring-boot-starter-data-redis 然后需要再项目中配置 Redis 的连接信息，在 application.properties 中配置如下内容： spring.redis.host=127.0.0.1 spring.redis.port=6379 #spring.redis.password=123456 # Redis 服务器密码，有密码的话需要配置此项经过以上两个步骤之后，我们就可以在 Spring Boot 的项目中正常的使用 RedisTemplate 对象来操作 Redis 了。

3. Database deduplication

We can also use the database to determine the repetition of URL. First, let's design a storage table for URL, as shown in the following figure:

The corresponding SQL for this table is as follows:

/ *

/ * Table: urlinfo * /

/ *

Create table urlinfo

(

Id int not null auto_increment

Url varchar (1000)

Ctime date

Del boolean

Primary key (id)

);

/ *

/ * Index: Index_url * /

/ *

Create index Index_url on urlinfo

(

Url

);

Where id is the self-increasing primary key, and the url field is set to index, setting the index can speed up the query.

Let's first add two pieces of test data to the database, as shown in the following figure:

We use the SQL statement to query, as shown in the following figure:

If the result is greater than 0, there is already a duplicate URL, otherwise there is no duplicate URL.

4. Unique index de-duplication

We can also use the unique index of the database to prevent URL duplication, which is very similar to the idea of the previous Set collection.

First of all, we set a unique index for the field URL, and then add URL data. If you can add it successfully, it means that the URL is not duplicated, otherwise it means repetition.

The SQL implementation for creating a unique index is as follows:

Create unique index Index_url on urlinfo

(

Url

);

Weight removal by 5.Guava Bloom filter

The Bloom filter (Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. The Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are far higher than the general algorithm, and the disadvantage is that it has a certain error recognition rate and deletion difficulties.

The core implementation of the Bloom filter is a very large bit array and several hash functions, assuming that the length of the bit array is m and the number of hash functions is k.

The above figure is an example of the specific operation flow: suppose there are three elements {x, y, z} in the set, and the number of hash functions is 3. First initialize the bit array and set the bit 0 for each bit in it. For each element in the set, the elements are mapped through three hash functions in turn, and each mapping produces a hash value, which corresponds to a point on the bit array, and then marks the corresponding position of the bit array as 1. When querying whether the W element exists in the set, the same method maps W to three points on the array by hashing. If one of the three points is not 1, you can judge that the element must not exist in the collection. Conversely, if all three points are 1, the element may exist in the collection. Note: it is not possible to determine whether the element must exist in the collection, there may be a certain misjudgment rate. You can see from the figure: suppose an element corresponds to the subscript 4, 5, and 6 by mapping. Although these three points are all 1, it is obvious that these three points are the positions of different elements obtained by hashing, so this situation shows that although the elements are not in the set, they may all correspond to 1, which is the reason for the existence of misjudgment rate.

We can use the Guava framework provided by Google to operate the Bloom filter, so that we can first add a reference to Guava in the pom.xml, as follows:

Com.google.guava

Guava

28.2-jre

The implementation code of URL judgment:

Public class URLRepeat {

/ / to be re-URL

Public static final String [] URLS = {

"www.apigo.cn"

"www.baidu.com"

"www.apigo.cn"

}

Public static void main (String [] args) {

/ / create a Bloom filter

BloomFilter filter = BloomFilter.create (

Funnels.stringFunnel (Charset.defaultCharset ())

10, / / number of elements expected to be processed

0.01); / / expected false alarm probability

For (int I = 0; I < URLS.length; iTunes +) {

String url = URLS [I]

If (filter.mightContain (url)) {

/ / use duplicate URL

System.out.println ("URL already exists:" + url)

} else {

/ / store the URL in the Bloom filter

Filter.put (url)

}

The results of the above procedures are as follows:

URL already exists: www.apigo.cn

Weight removal by 6.Redis Bloom filter

In addition to Guava's Bloom filter, we can also use Redis's Bloom filter to achieve URL weight judgment. Before using it, we need to make sure that the Redis server version is greater than 4.0 (the Bloom filter is supported only after this version), and that the Redis Bloom filter feature is enabled before it can be used properly.

Take Docker as an example. Let's demonstrate how to install and enable the Redis Bloom filter. Download Redis's Bloom filter first, and then enable the Bloom filter when the Redis service is restarted, as shown below:

After the Bloom filter is normally turned on using the Bloom filter, we first use the client redis-cli of Redis to implement the heavy judgment of the Bloom filter URL. The implementation command is as follows:

In Redis, there are few operation commands for Bloom filter, which mainly include the following:

Bf.add adds elements; bf.exists determines whether an element exists; bf.madd adds multiple elements; bf.mexists determines whether multiple elements exist; bf.reserve sets the accuracy of the Bloom filter.

Let's use the code to demonstrate the use of the Redis Bloom filter:

Import redis.clients.jedis.Jedis

Import utils.JedisUtils

Import java.util.Arrays

Public class BloomExample {

/ / Bloom filter key

Private static final String _ KEY = "URLREPEAT_KEY"

/ / to be re-URL

Public static final String [] URLS = {

"www.apigo.cn"

"www.baidu.com"

"www.apigo.cn"

}

Public static void main (String [] args) {

Jedis jedis = JedisUtils.getJedis ()

For (int I = 0; I < URLS.length; iTunes +) {

String url = URLS [I]

Boolean exists = bfExists (jedis, _ KEY, url)

If (exists) {

/ / duplicate URL

System.out.println ("URL already exists:" + url)

} else {

BfAdd (jedis, _ KEY, url)

}

/ * *

* add elements

* @ param jedis Redis client

* @ param key key

* @ param value value

* @ return boolean

, /

Public static boolean bfAdd (Jedis jedis, String key, String value) {

String luaStr = "return redis.call ('bf.add', KEYS [1], KEYS [2])"

Object result = jedis.eval (luaStr, Arrays.asList (key, value)

Arrays.asList ()

If (result.equals (1L)) {

Return true

}

Return false

}

/ * *

* query whether the element exists

* @ param jedis Redis client

* @ param key key

* @ param value value

* @ return boolean

, /

Public static boolean bfExists (Jedis jedis, String key, String value) {

String luaStr = "return redis.call ('bf.exists', KEYS [1], KEYS [2])"

Object result = jedis.eval (luaStr, Arrays.asList (key, value)

Arrays.asList ()

If (result.equals (1L)) {

Return true

}

Return false

}

The results of the above procedures are as follows:

URL already exists: www.apigo.cn

The above is about the content of this article on "what are the ways to remove weight from URL". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.