Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to analyze the regular expression, BS4, Xpath and CSS of the four selector of Python web crawler

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

How to analyze the regular expression, BS4, Xpath and CSS of the four selectors of Python web crawler? I believe that many inexperienced people don't know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Today, the editor will give you a summary of these four selectors, so that you can have a deeper understanding and familiarity with Python selectors.

Regular expressions

Regular expressions provide us with a shortcut to grab data. Although the regular expression is easier to adapt to future changes, it is difficult to construct and poor readability. When climbing JD.com, the regular expression is shown in the following figure:

Using regular expression to realize accurate collection of target information

In addition, as we all know, the web page often changes, resulting in some minor layout changes in the web page, which also makes the previously written regular expressions can not meet the requirements, and it is not easy to debug. When there is a lot of matching, using regular expressions to extract target information will slow down the program and consume more memory.

II. BeautifulSoup

BeautifulSoup is a very popular Pyhon module. This module can parse the web page and provide a convenient interface for locating content. The module can be installed through 'pip install beautifulsoup4'.

Using beautiful soup to extract target information

The first step in using BeautifulSoup is to parse the downloaded HTML content into an soup document. Since most web pages do not have a good HTML format, BeautifulSoup needs to determine the actual format. BeautifulSoup correctly parses missing quotation marks and closes tags, and adds < html > and < body > tags to make it a complete HTML document. We usually use the find () and find_all () methods to locate the elements we need. If you want to know all the methods and parameters of BeautifulSoup, you can check the official documentation of BeautifulSoup. Although BeautifulSoup is a little more complex in understanding the code than regular expressions, it is easier to construct and understand.

III. Lxml

Lxml module is written in C language, its parsing speed is faster than BeautiflSoup, and its installation process is more complex, so I won't repeat it here. XPath uses path expressions to select nodes in the XML document. Nodes are selected by following the path or step.

Xpath

The first step in using the lxml module, like BeautifulSoup, is to parse the potentially illegal HTML into a uniform format. Although Lxml correctly parses missing quotation marks on both sides of the attribute and closes the tag, the module does not add additional < html > and < body > tags.

Copying Xpath expressions online makes it easy to copy Xpath expressions. However, the Xpath expressions obtained by this method generally cannot be used in the program, and they are too long to be seen. So Xpath expressions usually have to be done by yourself.

IV. CSS

The CSS selector represents the pattern used to select the element. BeautifulSoup integrates the syntax of the CSS selector with its own ease of using API. During the development of web crawlers, using CSS selectors is a very convenient method for those who are familiar with CSS selector syntax.

CSS selector

Here are some examples of commonly used selectors.

Select all the tags: *

Select the < a > tag: a

Select all elements of class= "link": .l ink

Select the < a > tag of class= "link": a.link

Select the < a > tag of id= "home": a Jhome

Select all < span > child tags whose parent element is < a > tag: a > span

Select all < span > tags inside the < a > tag: a span

Select all < a > tags whose title attribute is "Home": a [title=Home]

Fifth, performance comparison

Lxml and regular expression modules are written in C, while BeautifulSoup is written in pure Python. The following table summarizes the advantages and disadvantages of each crawling method.

It's important to note that. In the internal implementation of lxml, you actually convert a CSS selector to an equivalent Xpath selector.

If your crawler's bottleneck is downloading web pages rather than extracting data, then using slower methods (such as BeautifulSoup) is not a problem. Regular expressions may be more appropriate if you only need to grab a small amount of data and want to avoid extra dependencies. In general, however, l xml is the best choice for grabbing data because it is fast and robust, while regular expressions and BeautifulSoup are only useful in certain scenarios.

After reading the above, do you know how to analyze the regular expressions, BS4, Xpath and CSS of the four selector of Python web crawler? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report