How to implement a better network request function in large-scale asynchronous news crawler 04/19 Update SLTechnology News&Howtos

How to implement a better network request function in large-scale asynchronous news crawler

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to implement a better network request function in large-scale asynchronous news crawlers. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

The realization of downloader

Import requestsimport cchardetimport tracebackdef downloader (url, timeout=10, headers=None, debug=False, binary=False): _ headers= {'User-Agent': (' Mozilla/5.0 (compatible; MSIE 9.0; 'Windows NT 6.1; Win64; x64) Trident/5.0)'),} redirected_url = url if headers: _ headers= headers try: r = requests.get (url, headers=_headers) Timeout=timeout) if binary: html = r.content else: encoding = cchardet.detect (r.content) ['encoding'] html = r.content.decode (encoding) status = r.status_code redirected_url = r.url except: if debug: traceback.print_exc () msg =' failed download: {}' .format (url) print (msg) if binary: html = baked 'else: html =' 'status = 0 return status Html, redirected_urlif _ _ name__ ='_ _ main__': url = 'http://news.baidu.com/' s, html = downloader (url) print (s, len (html))

This downloader () function has a built-in default User-Agent that simulates as an IE9 browser and accepts both headers and timeout customized by the caller. Use cchardet to deal with coding problems, and the returned data include:

Status code: if an exception occurs, set it to 0

Content: str content is returned by default. But when URL links binary content such as pictures, you should set binary=True when calling it.

Redirect URL: some URL will be redirected so that the url of the final page is contained in the response object

The cleaning of News URL

Let's first take a look at these two news websites:

Http://xinwen.eastday.com/a/n181106070849091.html?qid=news.baidu.com

Http://news.ifeng.com/a/20181106/60146589_0.shtml?_zbs_baidu_news

The top two belts? The website comes from the home page of Baidu News, this question mark? The role is to tell the target server, this URL is from Baidu news link, is the traffic brought by Baidu. But they are not expressed in exactly the same way, one is qid=news.baidu.com and the other is _ zbs_baidu_news. This may be due to the different format required by the target server, which may be used in the background browsing statistics program of the target server.

And get rid of the question mark? And the characters that follow, it is found that they point to the same news page as if they were not removed.

From the perspective of string comparison, question mark and no question mark are two different URLs, but they point to exactly the same news page, indicating that the parameters after the question mark have no effect on the response content.

After a great deal of practice of grabbing news, we have discovered this rule:

News URLs do a lot of SEO, they make news URLs static, basically ending with .html, .htm, .shtml, etc., followed by any request parameters will not help.

However, there are still some news websites that dynamically get news pages in the form of parameter id.

Then when we grab news, we should make use of this law to prevent repeated grabbing. As a result, we implement a function to clean the URL.

G_bin_postfix = set (['exe',' doc', 'docx',' xls', 'xlsx',' ppt', 'pptx',' pdf', 'jpg',' png', 'bmp',' jpeg', 'gif',' zip', 'rar',' tar', 'bz2',' 7z, 'gz',' flv', 'mp4',' avi', 'wmv' 'mkv',' apk',]) g_news_postfix = ['.html?', '.htm?', '.shtml?', '.shtm?',] def clean_url (url): # 1. Whether it is a legitimate http url if not url.startswith ('http'): return' # 2. Remove the parameter for np in g_news_postfix: P = url.find (np) if p >-1: P = url.find ('?') after static url. Url = url [: P] return url # 3. Do not download the link up = urlparse.urlparse (url) path = up.path if not path: path ='/ 'postfix = path.split ('.') [- 1] .lower () if postfix in g_bin_postfix: return'# 4. Remove the parameter that identifies the source of the traffic # badquery = ['spm',' utm_source', 'utm_source',' utm_medium' 'utm_campaign'] good_queries = [] for query in up.query.split (' &'): qv = query.split ('=') if qv [0] .startswith ('spm') or qv [0] .startswith (' utm_'): continue if len (qv) = 1: continue good_queries.append (query) query ='& '.startswith (good_queries) url = urlparse.urlunparse ((up.scheme Up.netloc, path, up.params, query,'# crawler do not care fragment)) return url

The methods for cleaning url are all in the comments of the code, which includes two types of operations:

Determine whether the url is legal, and return an empty string directly.

Remove unnecessary parameters, remove parameters of static url

Knowledge points of web crawlers

1. URL cleaning

Cleaning the url before the network request starts can avoid repeated downloads and invalid downloads (binary content) and save server and network overhead.

2. Cchardet module

This module is an updated version of chardet, which is exactly the same as chardet and is used to detect the encoding of a string. Because it is implemented in C and C++, it is very fast and is very suitable for judging the coding of web pages in crawlers.

Remember, don't trust the encoding returned by requests, it's more reassuring to judge for yourself. In the previous section, we have given an example to prove requests's error in code recognition. If you forget, you can review it again.

3. Traceback module

We write the crawler in the process of running, there will be a variety of exceptions, and some exceptions are unexpected, we do not know where it will appear, we need to use try to catch exceptions so that the program does not break, but we also need to see what the caught exception is, so as to improve our crawler. At this point, the traceback module is needed.

For example, in the downloader () function, we use try to catch the exception of get (), but the exception may also be caused by cchardet.detect (). Using traceback.print_exc () to output the exception will help us find more problems.

This is the end of the article on "how to achieve a better network request function in large-scale asynchronous news crawlers". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.