In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Today, I will talk to you about how to achieve the URL pool URL Pool, many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.
For larger crawlers, the management of URL management is a core issue. If it is not managed well, it may be downloaded repeatedly or omitted. Here, we design a URL pool to manage URL.
This URL pool is a producer-consumer model:
Producer-consumer flow chart
According to the gourd, URLPool is like this.
Designed web crawler URLPool
We design the interface of the URL pool for the purpose of using the URL pool, which should have the following functions:
Add URL to the pool
Take the URL from the pool to download
URL status should be managed internally in the pool.
The status of the URL I mentioned earlier is as follows:
Has been downloaded successfully
Download failed many times, no need to download again.
Downloading
If the download fails, try again.
The first two are permanent, that is, those that have been successfully downloaded will no longer be downloaded, and those who have failed after many attempts will no longer be downloaded, and they need to be stored permanently so that this permanent status record will not disappear after the crawler restarts. There are many ways for URLs that have been successfully downloaded not to be downloaded repeatedly and stored permanently:
For example, write directly to a text file, but it is not good to find out whether a URL already exists in the text
For example, a relational database such as MySQL, which is written directly, uses lookup, but the speed is relatively slow.
For example, the use of key-value database, search and speed meet the requirements, is a good choice!
Our URL pool chooses LevelDB as the permanent storage of URL state. LevelDB is an open source key database from Google that is very fast and automatically compresses data. We first use it to implement a UrlDB as a permanent storage database.
Implementation of UrlDB import leveldbclass UrlDB:''Use LevelDB to store URLs what have been done (succeed or faile)' status_failure = boun0' status_success = bail1' def _ init__ (self, db_name): self.name = db_name + '.urldb' self.db = leveldb.LevelDB (self.name) def load_from_db (self, status): urls = [] for url _ status in self.db.RangeIter (): if status = = _ status: urls.append (url) return urls def set_success (self, url): if isinstance (url, str): url = url.encode ('utf8') try: self.db.Put (url Self.state_success) s = True except: s = False return s def set_failure (self, url): if isinstance (url, str): url = url.encode ('utf8') try: self.db.Put (url) Self.status_failure) s = True except: s = False return s def has (self, url): if isinstance (url, str): url = url.encode ('utf8') try: attr = self.db.Get (url) return attr except: pass return False
UrlDB will be used by UrlPool, and there are three main methods to be used:
Has (url) to see if a url already exists
Set_success (url) Storage url status is successful
Set_failure (url) Storage url status is failed
The realization of UrlPool
The status of downloading and downloading failures of the two URL can be temporarily saved in the content. We put them into the UrlPool class for management, and then we implement the URL pool:
# Author: veelionimport pickleimport leveldbimport timeimport urllib.parse as urlparseclass UrlPool:''URL Pool for crawler to manage URLs' 'def _ init__ (self, pool_name): self.name = pool_name self.db = UrlDB (pool_name) self.pool = {} # host: set ([urls]), record to download URL self.pending = {} # url: pended_time Record the URL self.failure that has been pend but not updated (downloading) URL self.failure = {} # url: times, record the number of failed URL self.failure_threshold = 3 self.pending_threshold = 60 # pending maximum time Download self.in_mem_count = 0 self.max_hosts = [', 0] # [host: url_count] after expiration self.hub_pool = {} # {url: last_query_time} self.hub_refresh_span = 0 self.load_cache () def load_cache (self) ): path = self.name + '.pkl' try: with open (path, 'rb') as f: self.pool = pickle.load (f) cc = [len (v) for k, v in self.pool] print (' saved pool loaded! Urls:', sum (cc) except: pass def set_hubs (self, urls, hub_refresh_span): self.hub_refresh_span = hub_refresh_span self.hub_pool = {} for url in urls: self.hub_ Pool [url] = 0 def set_status (self, url) Status_code): if url in self.pending: self.pending.pop (url) if status_code = 200: self.db.set_success (url) return if status_code = 404: self.db.set_failure (url) return if url in self.failure: self.failure [url] + = 1 if self.failure [url] > self.failure_threshold: self.db.set_failure (url) self.failure.pop (url) else: self.add (url) else: self.failure [url] = 1 def push_to_pool (self) Url): host = urlparse.urlparse (url). Netloc if not host or'. Not in host: print ('try to push_to_pool with bad url:', url,' len of ur:' Len (url)) return False if host in self.pool: if url in self.pool [host]: return True self.pool [host] .add (url) if len (self.pool [host]) > self.max_hosts [1]: self.max_hosts [1] = len (self.pool [host]) Self.max_hosts [0] = host else: self.pool [host] = set ([url]) self.in_mem_count + = 1 return True def add (self) Url, always): if always: return self.push_to_pool (url) pended_time = self.pending.get (url, 0) if time.time ()-pended_time
< self.pending_threshold: print('being downloading:', url) return if self.db.has(url): return if pended_time: self.pending.pop(url) return self.push_to_pool(url) def addmany(self, urls, always=False): if isinstance(urls, str): print('urls is a str !!!!', urls) self.add(urls, always) else: for url in urls: self.add(url, always) def pop(self, count, hubpercent=50): print('\n\tmax of host:', self.max_hosts) # 取出的url有两种类型:hub=1, 普通=2 url_attr_url = 0 url_attr_hub = 1 # 1\. 首先取出hub,保证获取hub里面的最新url. hubs = {} hub_count = count * hubpercent // 100 for hub in self.hub_pool: span = time.time() - self.hub_pool[hub] if span < self.hub_refresh_span: continue hubs[hub] = url_attr_hub # 1 means hub-url self.hub_pool[hub] = time.time() if len(hubs) >= hub_count: break # 2\. Then take out the normal url # if a host has too many url Then you can take out 3 (delta) of its url if self.max_hosts [1] * 10 > self.in_mem_count: delta = 3 print ('\ tset delta:', delta,', max of host:') at a time Self.max_hosts) else: delta = 1 left_count = count-len (hubs) urls = {} for host in self.pool: if not self.pool [host]: # empty_host.append (host) continue if self.max_hosts [0] = = host: while Delta > 0: url = self.pool [host] .pop () self.max_hosts [1]-= 1 if not self.pool [host]: break delta-= 1 else: url = self.pool [host] .pop () Urls [url] = url_attr_url self.pending [url] = time.time () if len (urls) > = left_count: break self.in_mem_count-= len (urls) print ('To pop:%s Hubs:% s, urls:% s, hosts:%s'% (count, len (hubs), len (urls), len (self.pool)) urls.update (hubs) return urls def size (self,): return self.in_mem_count def empty (self) ): return self.in_mem_count = 0 def _ del__ (self): path = self.name + '.pkl' try: with open (path, 'wb') as f: pickle.dump (self.pool, f) print (' self.pool saved') Except: pass
The implementation of UrlPool is a little complicated, so let me break it down one by one.
The use of UrlPool
Let's take a look at its main members and their uses:
Self.db is an example of UrlDB, which is used to permanently store the permanent state of url.
Self.pool is used to store url, it is a dict structure, key is the host of url, and the value is a collection of all url used to store this host (set).
Self.pending is used to manage the status of url being downloaded. It is a dictionary structure, and key is the url,value that is the timestamp of it being pop. When a url is pop (), it is the beginning of its download. When the url is set_status (), the download ends. If a url is added to the pool and it is found that it has been trapped for longer than pending_threshold, it can be restocked and waiting to be downloaded. Otherwise, I will not join the pool for the time being.
Self.failue is a dictionary, key is the number of times url,value is recognized, if it exceeds failure_threshold, it will be permanently recorded as a failure and will no longer attempt to download.
Hub_pool is a dictionary used to store hub page faces, and key is hub url,value is the last time the hub page was refreshed.
The above members constitute the data structure of our URL pool, and then manipulate the URL pool through the following member methods:
1. Load_cache () and dump_cache () cache URL pools
Load_cache () is called in init () to try to load the URL pool cached on the last exit when the pool is created.
Dump_cache () is called in del (), which caches the in-memory URL pool to the hard disk before the URL pool is destroyed (such as a crawler exiting unexpectedly).
The pickle module is used here, which is a tool for serializing memory data to the hard disk.
* * 2. Set_hubs () method sets hub URL * *
Hub pages are pages like Baidu News, the whole page is full of news headlines and links, is the aggregate page of the news we really need, and such pages will be constantly updated to aggregate the latest news into such pages, we call them hub pages, and their URL is hub url. Adding a large number of such url to the news crawler will help the crawler find and grab the latest news in time.
The method is to pass such a hub url list to the URL pool, and when the crawler fetches the URL from the pool, the hub URL is collected according to the time interval (self.hub_refresh_span).
* * 3. Add (), addmany (), push_to_pool () enter the URL pool * *
When you put the url into the URL pool, first check whether the url exists in the self.pending in memory, that is, whether the url is being downloaded. Do not enter the pool if the URL is being downloaded; if it is being downloaded or has timed out, proceed to the next step
Then check to see if the URL already exists in the sex LevelDB, which indicates that it has been downloaded successfully or completely failed before, and that it is no longer downloaded or entered the pool. If not, proceed to the next step.
Finally, the url is put into the self.pool through push_to_pool (). The storage rule is that the url of the same host is classified according to the host of the url, and each master takes a url when it is taken out, so as to ensure that each batch of url is directed to a different server. The purpose of this is to minimize the request pressure on the crawling target server. Strive to be a server-friendly crawler O (∩∩ _ ∩) O
4. Pop () performs outbound operations on URL pool
Through this method, the crawler gets a batch of url from the URL pool to download. Take out the url in two steps:
The first step is to get it from the self.hub_pool by iterating through the hub_pool and checking whether the time interval between each hub-URL and the last pop-up exceeds the hub page refresh interval (self.hub_refresh_span) to determine whether the hub-URL should be ejected.
The second step is to get it from self.pool. The popular principle introduced in the previous push_to_pool is that each batch of URL taken out points to a different server. With the special data structure of self.pool, it is easy to install this principle to obtain the URL. Just press the host (self-.pool key) to traverse the self.pool.
5. The set_status () method sets the status of url in the URL pool
Its parameter status_code is the status code of the http response. The crawler sets the url status after downloading the URL.
First, delete the url into self.pending. It has been downloaded and is no longer in a pending state.
Then, according to the STATUS_CODE to set the URL state, 200,404 is directly set to the permanent state; the other states record the number of failures and re-enter the pool for subsequent download attempts.
Through the above member variables and methods, we parse the URL pool (UrlPool) clearly. Apes can be impolitely collected, in the future when writing crawlers can easily use it to manage URL, and this implementation has only one PY file, easy to add to any project.
Reptile knowledge point
1. Management of web sites
The purpose of the management of the website is: do not re-grasp, do not miss it.
2. Pickle module
Save the memory data to the hard disk, and then reload the hard disk data into memory, which is a necessary step for many programs to stop and start. Pickle is the module that transfers data between memory and hard disk.
3. Leveldb module
This is a classic and powerful hard disk key-value database, which is very suitable for the storage of url-status structure.
4. Urllib.parse
The module that parses the URL should be the first module that comes to mind when dealing with url.
After reading the above, do you have any further understanding of how to implement URL pool URL Pool? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.