What are the four ways for Python crawler to parse web pages? 07/03 Update SLTechnology News&Howtos

What are the four ways for Python crawler to parse web pages?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article shows you what are the four ways of Python crawler parsing web pages, which are concise and easy to understand, which can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Using Python to write crawler tools is a common thing nowadays, and everyone wants to write a program to pick up some data on the Internet and use it for data analysis or something else.

We know that the principle of the crawler is nothing more than downloading the content of the target URL and storing it in memory, at this time its content is actually a pile of HTML, and then parse these HTML contents and extract the desired data according to your own ideas, so today we mainly talk about four ways to parse the HTML content of web pages in Python, each of which has its own advantages and is suitable for use in different situations.

First of all, we found a website at random, and then the website Douban flashed through my mind. Well, after all, it's a website built in Python, so use it as a demonstration.

We found Douban's Python crawler group home page, which looks like this.

Let's use the browser developer tool to look at the HTML code and navigate to the content we want. We want to pick out the title and link of the post in the discussion group.

Through the analysis, we found that what we want is actually in this area of the HTML code, so we just need to find a way to get the content out of this area.

Now start writing the code.

1: regular expression Dafa

Regular expressions are usually used to retrieve and replace texts that conform to a certain pattern, so we can use this principle to extract the information we want.

Refer to the following code.

In lines 6 and 7 of the code, you need to manually specify the contents of the header and pretend that this request is a browser request, otherwise Douban will return a HTTP 418error as if our request is not normal.

In line 7, we directly use the get method of the requests library to make a request. After obtaining the content, we need to convert the encoding format. This is also due to the problem of Douban's page rendering mechanism. Normally, you can directly obtain the content of requests content.

Python mock browser initiates a request and parses the content code:

Url= 'https://www.douban.com/group/491607/'headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/71.0"} response = requests.get (url=url,headers=headers) .content.decode (' utf-8')

The advantage of regularization is that it is difficult to write and understand, but the matching efficiency is very high, but now that there are too many off-the-shelf HTMl content parsing libraries, I personally do not recommend manually using regularities to match content, which is time-consuming and laborious.

The main parsing code:

Re_div = r'[\ W |\ w] + 'pattern = re.compile (re_div) content = re.findall (pattern, str (response)) re_link = r' (. *?)'mm = re.findall (re_link, str (content), re.S | re.M) urls=re.findall (r "

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.