How to achieve url weight removal by Python crawler 07/02 Update SLTechnology News&Howtos

How to achieve url weight removal by Python crawler

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "Python crawler how to achieve url de-weight". In daily operation, I believe that many people have doubts about how to achieve url de-weight in Python crawler. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the question of "how Python crawler realizes url weight removal". Next, please follow the editor to study!

I. Preface

Strategy and implementation of url de-duplication in Python crawler.

II. Introduction to url de-duplication and strategy 1.url de-duplication

literally understands that removing duplicate url means removing repeated url. In crawlers, removing url that has already been crawled can avoid repeated crawling, which not only affects the efficiency of crawlers, but also produces redundant data.

2.url deduplication strategy

On the surface of , url de-duplication strategy is a way to eliminate url duplication. There are five common url de-duplication strategies, as follows:

1# 1. Save the visited ur to the database

2# 2. Save the visited ur to set (collection), and you can query url at the cost of o (1)

"10000000*2byte*50 characters / 1024 Universe 1024 characters 1024 characters 9G

Save 3.url to set after being hashed by md5 and other methods

5# 4. Use the bitmap method to map the visited ur to a bit through the hash function

Bloomfilter method improves bitmap, and multiple hash functions reduce conflicts.

Third, look at the code, learn while tapping while memorizing url to repeat strategy 1. Save the visited ur to the database (beginner)

It is the easiest to implement, but the least efficient.

The core idea is to store each url crawled on the page to the database. In order to avoid repetition, before each storage, you have to traverse to query whether the current url already exists in the database (that is, whether it has already been crawled). If so, do not save it, otherwise, save the current url and continue to save the next one until the end.

two。 Save the accessed ur to set memory

Save the visited ur to set, you can query the url at the cost of o (1). Getting url is convenient and fast, basically without query, but as more and more url is stored, it will take up more and more memory.

Simple calculation: suppose there are 100 million url, and the average length of each url is 50 characters. Unicode is encoded in python, each character is 16 bits, accounting for 2.

One byte (byte)

Calculation formula: 10 ^ 8 x 50 characters x 2 byte / 1024 / 1024 / 1024 = 9G

4# B M G

5 if it is 200 million url, it will take up to 18 gigabytes of memory, which is not particularly convenient and suitable for small crawlers.

3.url has been reduced to a fixed length of 1mm'by md5.

2 simple calculation: a url is converted by MD5 into a 128bit (bit) string, accounting for 16byte (bytes). In method 2, a url is conservative.

3 estimated to account for 50 characters x 2 = 100byte (bytes)

4 calculation formula: in this comparison, the space saving rate of MD5 is: (100-16) / 100 = 84% (compared with method 2)

5 (url de-duplication of Scrapy framework is a similar method)

6 years old'

Look at the MD5 algorithm in Wikipedia

8 years old

Overview of 9MD5

10 designer: Ronald Livester

11 first released: April 1992

12 Series: MD, MD2, MD3, MD4, MD5

13 Encoding length: 128 bits

14 structure: Merkle-Damg å rd construction

15 MD5 message digest algorithm (English: MD5 Message-Digest Algorithm), a widely used cryptographic hash function, can

16 to produce a 128-bit (16-byte) hash value (hash value) that is used to ensure complete and consistent transmission of information. MD5 by American cryptographer

17 designed by Ronald Livester (Ronald Linn Rivest), published in 1992, to replace the MD4 algorithm. The program of this algorithm is in

It is regulated in 18RFC 1321.

The basic principle of hashing algorithm is to change the operation of data (such as a piece of text) into another fixed length value.

20 years old.

Examples of MD5 usage:

Using hashlib module for md5 operation in python3

2import hashlib

three

Information to be encrypted

5str01 = 'This is your md5 passwordkeeper'

Create md5 object

7md5_obj = hashlib.md5 ()

Encode (encoding) is required before MD5 encryption. Unicode encoding is the default in python and must be converted to utf-8.

"otherwise, report an error: TypeError: Unicode-objects must be encoded before hashing

10md5_obj.update (str01.encode (encoding='utf-8'))

eleven

12print (the original words of 'XksA are:' + str01)

13print ('MD5 encrypted:' + md5_obj.hexdigest ())

fourteen

15# result:

The original words of XksA are: This is your md5 password!

Encrypted MD5 is: 0a5f76e7b0f352e47fed559f904c9159

4. Use the bitmap method to map the visited ur to a bit 1 percent 'through the hash function

2 implementation principle: through the hash function, each url is mapped to a hash position, and a hash bit can occupy only one bit (bit) size, then

3 compared with method 3: one url occupies 128bit (bit), the space saving of hash function method increases a hundred times.

4 calculation formula: by comparison, the space saving rate of bitmap method is as follows:

5 (128-1) / 128 = 99.2% (compared to method 3)

6 (100-8-1) / (100-8) = 99.88% (compared to method 1)

7 # # (disadvantage: easy to conflict) # #

8 years old

Look at the Hash function in Wikipedia

10 million dollars.

11hash function:

12 hash function (English: Hash function), also known as hash algorithm, hash function, is a kind of creating small digital "fingerprints" from any kind of data.

The method of 13. The hash function compresses the message or data into a summary, reducing the amount of data and fixing the format of the data. This function messes up the data

14, recreate a fingerprint called hash value (hash values,hash codes,hash sums, or hashes). Hash values are usually

15 is represented by a short string of random letters and numbers. Good hash functions rarely have hash conflicts in the input field. In hash tables and numbers

In data processing, not suppressing conflicts to distinguish data will make database records more difficult to find.

17 years old'

5.bloomfilter method to improve bitmap, multiple hash functions to reduce conflicts Wikipedia see Bloomfilter

2You'

"basic overview

4 if you want to determine whether an element is in a collection, the general idea is to save all the elements in the set, and then determine by comparison.

5 linked lists, trees, hash tables (also known as hash tables, Hash table) and other data structures are all this way of thinking. But as the number of elements in the collection increases

6 We need more and more storage space. At the same time, the retrieval speed is getting slower and slower, and the retrieval time complexity of the above three structures is respectively:

7 O (n), O (log n), O (nadir k)

An overview of the principle of cymbal

The principle of the 9 Bloom filter is that when an element is added to the set, the element is mapped to K in a bit array by K hash functions.

At 10:00, set them to 1. When searching, we just need to see if these points are all 1 to know if there is it in the collection: if these points

11 if there are any zeros, the checked element must not be there; if it is all 1, the checked element is likely to be there. This is the basic idea of the Bloom filter.

1. Advantages and disadvantages

The 13 Bloom filter can be used to retrieve whether an element is in a collection.

14 the advantage is that the space efficiency and query time are much higher than the general algorithm.

15 the disadvantage is that there is a certain error recognition rate and deletion difficulties.

16 years old.

The introduction to Bloomfilter can also be seen here: https://blog.csdn.net/preyta/article/details/72804148

The underlying implementation of Bloomfilter:

Source address: https://github.com/preytaren/fastbloom/blob/master/fastbloom/bloomfilter.py

2import math

3import logging

4import functools

five

6import pyhash

seven

8from bitset import MmapBitSet

9from hash_tools import hashes

ten

eleven

12class BloomFilter (object):

13 ""

14 A bloom filter implementation

15 which use Murmur hash and Spooky hash

16 ""

17 def _ init__ (self, capacity, error_rate=0.0001, fname=None

18 h2=pyhash.murmur3_x64_128 (), h3=pyhash.spooky_128 ():

19 ""

20: param capacity: size of possible input elements

21: param error_rate: posi

22: param fname:

23: param h2:

24: param h3:

25 ""

26 # calculate m & k

27 self.capacity = capacity

28 self.error_rate = error_rate

29 self.num_of_bits, self.num_of_hashes = self._adjust_param (4096 * 8

30 error_rate)

31 self._fname = fname

32 self._data_store = MmapBitSet (self.num_of_bits)

33 self._size = len (self._data_store)

34 self._hashes = functools.partial (hashes, h2=h2, h3=h3, number=self.num_of_hashes)

thirty-five

36 def _ adjust_param (self, bits_size, expected_error_rate):

37 ""

38 adjust k & m through 4 steps:

39 1. Choose a ballpark value for n

40 2. Choose a value for m

41 3. Calculate the optimal value of k

42 4. Calculate the error rate for our chosen values of n, m, and k.

43 If it's unacceptable, return to step 2 and change m

44 otherwise we're done.

45 in every loop, m = m * 2

46: param bits_size:

47: param expected_error_rate:

48: return:

49 ""

50 n, estimated_m, estimated_k, error_rate = self.capacity, int (bits_size / 2), None, 1

51 weight, e = math.log (2), math.exp (1)

52 while error_rate > expected_error_rate:

53 estimated_m * = 2

54 estimated_k = int ((float (estimated_m) / n) * weight) + 1

55 error_rate = 1-math.exp (- (estimated_k * n) / estimated_m)) * * estimated_k

56 logging.info (estimated_m, estimated_k, error_rate)

57 return estimated_m, estimated_k

fifty-eight

59 def add (self, msg):

60 ""

61 add a string to bloomfilter

62: param msg:

63: return:

64 ""

65 if not isinstance (msg, str):

66 msg = str (msg)

67 positions = []

68 for _ hash_value in self._hashes (msg):

69 positions.append (_ hash_value self.num_of_bits)

70 for pos in sorted (positions):

71 self._data_store.set (int (pos))

seventy-two

73 @ staticmethod

74 def open (self, fname):

75 with open (fname) as fp:

76 raise NotImplementedError

seventy-seven

78 def _ str__ (self):

79 ""

80 output bitset directly

81: return:

82 ""

83 pass

eighty-four

85 def _ contains__ (self, msg):

86 if not isinstance (msg, str):

87 msg = str (msg)

88 positions = []

89 for _ hash_value in self._hashes (msg):

90 positions.append (_ hash_value self.num_of_bits)

91 for position in sorted (positions):

92 if not self._data_store.test (position):

93 return False

94 return True

ninety-five

96 def _ len__ (self):

97 return self._size here, about "Python crawler how to achieve url de-weight" study is over, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.