How to perform simple news crawling on Baidu by Python 07/02 Update SLTechnology News&Howtos

How to perform simple news crawling on Baidu by Python

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

Python how to carry out a simple Baidu news crawl, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

This practical example is to build a large-scale asynchronous news crawler, but to build the Python crawler step by step, from simple to complex.

All the code in this tutorial is implemented in Python 3.6.It does not take into account Python 2. It is strongly recommended that you use Python 3.

To grab the news, we must first have the news source, that is, the target website. There are thousands of domestic news websites, large and small, from central to local, from comprehensive to vertical industries. Baidu News (news.baidu.com) includes more than 2, 000. So let's start with Baidu News.

Open the home page of Baidu News: news.baidu.com

We can see that this is a news aggregation page, which lists a lot of news headlines and their original links. As shown in the figure:

Our goal is to extract links to those news from here and download them. The process is relatively simple:

Simple flow chart of news crawler

According to this simple process, let's first implement the following simple code:

#! / usr/bin/env python3# Author: veelionimport reimport timeimport requestsimport tldextractdef save_to_db (url, html): # Save the web page to the database, and we temporarily replace print ('% s:% s'% (url, len (html)) def crawl (): # 1\. Download baidu news hub_url = 'http://news.baidu.com/' res = requests.get (hub_url) html = res.text # 2\. Extract news links # # 2.1 extract all links with 'href' links = re.findall (ritual hrefs = [\' "]? (. *?) [\'"\ s]', html) print ('find links:' Len (links)) news_links = [] # 2.2 filter non-news link for link in links: if not link.startswith ('http'): continue tld = tldextract.extract (link) if tld.domain = =' baidu': continue news_links.append (link) print ('find news links:', len (news_links)) # 3\. Download news and save to database for link in news_links: html = requests.get (link). Text save_to_db (link, html) print ('works examples') def main (): while 1: crawl () time.sleep (300) if _ _ name__ = ='_ main__': main ()

Briefly explain the above code:

1. Use requests to download the home page of Baidu News

two。 First use the regular expression to extract the href attribute of the a tag, that is, the link in the web page, and then find out the news link by assuming that the external links of non-Baidu are all news links.

3. Download all the news links found one by one and save them to the database; the functions saved to the database are temporarily replaced by printing relevant information.

4. Repeat 1-3 steps every 300 seconds to grab updated news.

The above code works, but it only works, and there are too many slots, so let's complain and perfect this crawler.

1. Add exception handling

When writing crawlers, especially the code related to web requests, there must be exception handling. Whether the target server is normal or not and whether the network connection is smooth (timeout) at that time are beyond the control of the crawler, so exceptions must be handled when processing network requests. It is best to set timeout for a network request. Don't spend too much time on a request. The identification caused by timeout may be caused by the failure of the server to respond, or it may be a temporary network problem. So, for the exception of timeout, we need to try again after a while.

two。 To deal with the status returned by the server, such as 404500, etc.

The state returned by the server is important, which determines what we crawlers should do next. Common states that need to be addressed are:

The URL is permanently transferred to another URL, and if requested later, the transferred URL is requested.

404, basically, this site has become invalid, so don't try it later.

500, there is an internal error in the server, which may be temporary. We will request to try again later.

3. Manage the status of URL

Record the failed URL so that you can try again later. For the URL of timeout, you need to crawl again later, so you need to record the various states of all URL, including:

Has been downloaded successfully

Download failed many times, no need to download again.

Downloading

If the download fails, try again.

With the addition of a variety of processing of network requests, the crawler is much more robust and will not exit abnormally easily, bringing a lot of workload to the later operation and maintenance.

In the next section, we will improve the code of the above three slots one by one. For more information, please listen to the next decomposition.

Python crawler knowledge points

In this section, we use several modules of Python, and their role in the crawler is as follows:

1. Requests module

It is used to make http network requests and download URL content, which is easier to use than the urllib.request,requests that comes with Python. GET,POST came at his fingertips:

Import requestsres = requests.get (url, timeout=5, headers=my_headers) res2 = requests.post (url, data=post_data, timeout=5, headers=my_headers)

The get () and post () functions have many parameters to choose from, which are used to set timeout and customize headers. For more parameters, please see the requests documentation.

Requests, whether get () or post (), returns a Response object through which the downloaded content is obtained:

Res.content is the resulting binary content, whose type is bytes

Res.text is the str content after binary content content decode

It first finds encoding from response headers, then automatically judges encoding through chardet, assigns values to res.encoding, and finally decrypts binary content to str type.

Experience: res.text sometimes makes mistakes when judging Chinese coding, or it is more accurate to get it through cchardet (chardet implemented in C language). Here, let's give an example:

In [1]: import requestsIn [2]: r = requests.get ('http://epaper.sxrb.com/')In [3]: r.encodingOut [3]:' ISO-8859-1'In [4]: import chardetIn [5]: chardet.detect (r.content) Out [5]: {'confidence': 0.99,' encoding': 'utf-8',' language':''}

The above is demonstrated with the ipython interactive interpreter (ipython is highly recommended, which is much better than Python's own interpreter). Open the URL is the Shanxi Daily Digital Daily, manually check the source code of the web page is utf8, with chardet to judge is also utf8. If the encoding judged by requests itself is ISO-8859-1, then the Chinese of the returned text will be garbled.

Another good thing about requests is Session, which is partly similar to a browser and saves cookies. Later, both login and cookies-related crawlers can be implemented with its session.

2. Re module

Regular expressions are mainly used to extract relevant content in html, such as link extraction in this case. For more complex html content extraction, it is recommended to use lxml.

3. Tldextract module

This is a third-party module that needs to be installed by pip install tldextract. It means Top Level Domain extract, that is, top-level domain name extraction. We talked about the structure of URL earlier. The news.baidu.com in news.baidu.com is called host. It is a subdomain of the registered domain name baidu.com, and com is the top-level domain TLD. The result is this:

In [6]: import tldextractIn [7]: tldextract.extract ('http://news.baidu.com/')Out[7]: ExtractResult (subdomain='news', domain='baidu', suffix='com'))

The return structure consists of three parts: subdomain, domain, and suffix

4. Time module

Time is a concept that we often use in programs, such as pausing for a period of time in a loop, getting the current timestamp, and so on. The time module is the module that provides time-related functions. At the same time, there is another module, datetime, which is also time-dependent and can be selected according to the situation.

Remember these modules, which will benefit a lot from writing crawlers in the future.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.