In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "Python crawler how to achieve url de-weight". In daily operation, I believe that many people have doubts about how to achieve url de-weight in Python crawler. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the question of "how Python crawler realizes url weight removal". Next, please follow the editor to study!
I. Preface
Strategy and implementation of url de-duplication in Python crawler.
II. Introduction to url de-duplication and strategy 1.url de-duplication
literally understands that removing duplicate url means removing repeated url. In crawlers, removing url that has already been crawled can avoid repeated crawling, which not only affects the efficiency of crawlers, but also produces redundant data.
2.url deduplication strategy
On the surface of , url de-duplication strategy is a way to eliminate url duplication. There are five common url de-duplication strategies, as follows:
1# 1. Save the visited ur to the database
2# 2. Save the visited ur to set (collection), and you can query url at the cost of o (1)
"10000000*2byte*50 characters / 1024 Universe 1024 characters 1024 characters 9G
Save 3.url to set after being hashed by md5 and other methods
5# 4. Use the bitmap method to map the visited ur to a bit through the hash function
Bloomfilter method improves bitmap, and multiple hash functions reduce conflicts.
Third, look at the code, learn while tapping while memorizing url to repeat strategy 1. Save the visited ur to the database (beginner)
It is the easiest to implement, but the least efficient.
The core idea is to store each url crawled on the page to the database. In order to avoid repetition, before each storage, you have to traverse to query whether the current url already exists in the database (that is, whether it has already been crawled). If so, do not save it, otherwise, save the current url and continue to save the next one until the end.
two。 Save the accessed ur to set memory
Save the visited ur to set, you can query the url at the cost of o (1). Getting url is convenient and fast, basically without query, but as more and more url is stored, it will take up more and more memory.
Simple calculation: suppose there are 100 million url, and the average length of each url is 50 characters. Unicode is encoded in python, each character is 16 bits, accounting for 2.
One byte (byte)
Calculation formula: 10 ^ 8 x 50 characters x 2 byte / 1024 / 1024 / 1024 = 9G
4# B M G
5 if it is 200 million url, it will take up to 18 gigabytes of memory, which is not particularly convenient and suitable for small crawlers.
3.url has been reduced to a fixed length of 1mm'by md5.
2 simple calculation: a url is converted by MD5 into a 128bit (bit) string, accounting for 16byte (bytes). In method 2, a url is conservative.
3 estimated to account for 50 characters x 2 = 100byte (bytes)
4 calculation formula: in this comparison, the space saving rate of MD5 is: (100-16) / 100 = 84% (compared with method 2)
5 (url de-duplication of Scrapy framework is a similar method)
6 years old'
Look at the MD5 algorithm in Wikipedia
8 years old
Overview of 9MD5
10 designer: Ronald Livester
11 first released: April 1992
12 Series: MD, MD2, MD3, MD4, MD5
13 Encoding length: 128 bits
14 structure: Merkle-Damg å rd construction
15 MD5 message digest algorithm (English: MD5 Message-Digest Algorithm), a widely used cryptographic hash function, can
16 to produce a 128-bit (16-byte) hash value (hash value) that is used to ensure complete and consistent transmission of information. MD5 by American cryptographer
17 designed by Ronald Livester (Ronald Linn Rivest), published in 1992, to replace the MD4 algorithm. The program of this algorithm is in
It is regulated in 18RFC 1321.
The basic principle of hashing algorithm is to change the operation of data (such as a piece of text) into another fixed length value.
20 years old.
Examples of MD5 usage:
Using hashlib module for md5 operation in python3
2import hashlib
three
Information to be encrypted
5str01 = 'This is your md5 passwordkeeper'
Create md5 object
7md5_obj = hashlib.md5 ()
Encode (encoding) is required before MD5 encryption. Unicode encoding is the default in python and must be converted to utf-8.
"otherwise, report an error: TypeError: Unicode-objects must be encoded before hashing
10md5_obj.update (str01.encode (encoding='utf-8'))
eleven
12print (the original words of 'XksA are:' + str01)
13print ('MD5 encrypted:' + md5_obj.hexdigest ())
fourteen
15# result:
The original words of XksA are: This is your md5 password!
Encrypted MD5 is: 0a5f76e7b0f352e47fed559f904c9159
4. Use the bitmap method to map the visited ur to a bit 1 percent 'through the hash function
2 implementation principle: through the hash function, each url is mapped to a hash position, and a hash bit can occupy only one bit (bit) size, then
3 compared with method 3: one url occupies 128bit (bit), the space saving of hash function method increases a hundred times.
4 calculation formula: by comparison, the space saving rate of bitmap method is as follows:
5 (128-1) / 128 = 99.2% (compared to method 3)
6 (100-8-1) / (100-8) = 99.88% (compared to method 1)
7 # # (disadvantage: easy to conflict) # #
8 years old
Look at the Hash function in Wikipedia
10 million dollars.
11hash function:
12 hash function (English: Hash function), also known as hash algorithm, hash function, is a kind of creating small digital "fingerprints" from any kind of data.
The method of 13. The hash function compresses the message or data into a summary, reducing the amount of data and fixing the format of the data. This function messes up the data
14, recreate a fingerprint called hash value (hash values,hash codes,hash sums, or hashes). Hash values are usually
15 is represented by a short string of random letters and numbers. Good hash functions rarely have hash conflicts in the input field. In hash tables and numbers
In data processing, not suppressing conflicts to distinguish data will make database records more difficult to find.
17 years old'
5.bloomfilter method to improve bitmap, multiple hash functions to reduce conflicts Wikipedia see Bloomfilter
2You'
"basic overview
4 if you want to determine whether an element is in a collection, the general idea is to save all the elements in the set, and then determine by comparison.
5 linked lists, trees, hash tables (also known as hash tables, Hash table) and other data structures are all this way of thinking. But as the number of elements in the collection increases
6 We need more and more storage space. At the same time, the retrieval speed is getting slower and slower, and the retrieval time complexity of the above three structures is respectively:
7 O (n), O (log n), O (nadir k)
An overview of the principle of cymbal
The principle of the 9 Bloom filter is that when an element is added to the set, the element is mapped to K in a bit array by K hash functions.
At 10:00, set them to 1. When searching, we just need to see if these points are all 1 to know if there is it in the collection: if these points
11 if there are any zeros, the checked element must not be there; if it is all 1, the checked element is likely to be there. This is the basic idea of the Bloom filter.
1. Advantages and disadvantages
The 13 Bloom filter can be used to retrieve whether an element is in a collection.
14 the advantage is that the space efficiency and query time are much higher than the general algorithm.
15 the disadvantage is that there is a certain error recognition rate and deletion difficulties.
16 years old.
The introduction to Bloomfilter can also be seen here: https://blog.csdn.net/preyta/article/details/72804148
The underlying implementation of Bloomfilter:
Source address: https://github.com/preytaren/fastbloom/blob/master/fastbloom/bloomfilter.py
2import math
3import logging
4import functools
five
6import pyhash
seven
8from bitset import MmapBitSet
9from hash_tools import hashes
ten
eleven
12class BloomFilter (object):
13 ""
14 A bloom filter implementation
15 which use Murmur hash and Spooky hash
16 ""
17 def _ init__ (self, capacity, error_rate=0.0001, fname=None
18 h2=pyhash.murmur3_x64_128 (), h3=pyhash.spooky_128 ():
19 ""
20: param capacity: size of possible input elements
21: param error_rate: posi
22: param fname:
23: param h2:
24: param h3:
25 ""
26 # calculate m & k
27 self.capacity = capacity
28 self.error_rate = error_rate
29 self.num_of_bits, self.num_of_hashes = self._adjust_param (4096 * 8
30 error_rate)
31 self._fname = fname
32 self._data_store = MmapBitSet (self.num_of_bits)
33 self._size = len (self._data_store)
34 self._hashes = functools.partial (hashes, h2=h2, h3=h3, number=self.num_of_hashes)
thirty-five
36 def _ adjust_param (self, bits_size, expected_error_rate):
37 ""
38 adjust k & m through 4 steps:
39 1. Choose a ballpark value for n
40 2. Choose a value for m
41 3. Calculate the optimal value of k
42 4. Calculate the error rate for our chosen values of n, m, and k.
43 If it's unacceptable, return to step 2 and change m
44 otherwise we're done.
45 in every loop, m = m * 2
46: param bits_size:
47: param expected_error_rate:
48: return:
49 ""
50 n, estimated_m, estimated_k, error_rate = self.capacity, int (bits_size / 2), None, 1
51 weight, e = math.log (2), math.exp (1)
52 while error_rate > expected_error_rate:
53 estimated_m * = 2
54 estimated_k = int ((float (estimated_m) / n) * weight) + 1
55 error_rate = 1-math.exp (- (estimated_k * n) / estimated_m)) * * estimated_k
56 logging.info (estimated_m, estimated_k, error_rate)
57 return estimated_m, estimated_k
fifty-eight
59 def add (self, msg):
60 ""
61 add a string to bloomfilter
62: param msg:
63: return:
64 ""
65 if not isinstance (msg, str):
66 msg = str (msg)
67 positions = []
68 for _ hash_value in self._hashes (msg):
69 positions.append (_ hash_value self.num_of_bits)
70 for pos in sorted (positions):
71 self._data_store.set (int (pos))
seventy-two
73 @ staticmethod
74 def open (self, fname):
75 with open (fname) as fp:
76 raise NotImplementedError
seventy-seven
78 def _ str__ (self):
79 ""
80 output bitset directly
81: return:
82 ""
83 pass
eighty-four
85 def _ contains__ (self, msg):
86 if not isinstance (msg, str):
87 msg = str (msg)
88 positions = []
89 for _ hash_value in self._hashes (msg):
90 positions.append (_ hash_value self.num_of_bits)
91 for position in sorted (positions):
92 if not self._data_store.test (position):
93 return False
94 return True
ninety-five
96 def _ len__ (self):
97 return self._size here, about "Python crawler how to achieve url de-weight" study is over, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.