Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use beautifulsoup4 Library

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly shows you "how to use the beautifulsoup4 library", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "how to use the beautifulsoup4 library" this article.

The use of beautifulsoup4 Library

After you use the requests library to get the HTML page and convert it to a string, you need to further parse the HTML page format and extract useful information, which requires dealing with HTML and XML libraries. The beautifulsoup4 library, also known as the BeautifulSoup library or bs4 library, is used to parse and process HTML and XML. It is important to note that it is not a BeautifulSoup library. Its biggest advantage is that it can build a parsing tree according to HTML and XML syntax, and then parse its contents efficiently. The beautifulsoup4 library is implemented with object-oriented thinking. To put it simply, it treats each page as an object, through the

< a>

.

< b>

() all calling methods (that is, handling functions).

Some properties commonly used in BeautifulSoup are as follows:

Of the head:HTML page

< head>

Content

Title:HTML page title, in

< head>

Among them, there are

< title>

Marking

Of the body:HTML page

< body>

Content

The first one on the p:HTML page

< p>

Content

All the strings rendered on the Web on the strings:HTML page, that is, the contents of the tag

All non-white space strings rendered on the Web on the stripped_strings:HTML page

The BeautifulSoup attribute has the same tag name as HTML, much more than that.

Common attributes of label objects:

Name: string, the name of the tag, such as div

Attrs: dictionary that contains all the attributes of the original page Tag, such as href

Contents: list, the contents of all sub-Tag under this Tag

String: string, the text surrounded by Tag, the actual text in the web page. The return value of the string attribute follows the following principles:

(1) if there are no other tags inside the tag, the string property returns its contents.

(2) if there are other tags inside the tag, but there is only one tag, the string attribute returns the contents of the innermost tag.

(3) if there is more than one layer of nested tags inside the tag, the string property returns None (empty string).

Two of the methods of BeautifulSoup (which traverse the entire HTML document and return tag content conditionally):

BeautifulSboup.find_all (name,attrs,recursive,string,limit)

Function: find the corresponding label according to the parameters and return the list type. The parameters are as follows:

Name: according to the tag tag, the name is expressed as a string, such as div, li.

Attrs: retrieved according to the attribute value of the tag tag. The attribute name and value need to be listed and expressed in JSON.

Recursive: sets the lookup hierarchy to use recursive=False when looking only one layer below the current tab.

String: string attribute content is retrieved by keyword, starting with string=.

Limit: the number of results returned. All results are returned by default.

To put it simply, BeautifulSoup's find_all () method can retrieve and return a list of tags based on tag name, tag attribute and content. Regular expression re function library is needed for fragment string retrieval, and Python standard library for re can be used directly through importre. Re.comlile ('jquery') is used to retrieve fragment strings (such as' jquery'). When retrieving a tag attribute, the attribute and the corresponding value are in JSON format, for example: 'src':re.compile (' jquery'), where the part of the value in the key-value pair can be a string or a regular expression.

In addition to the find_all () method, the BeautifulSoup class also provides a find () method, which differs only in that the former returns all the results and the latter returns the first result found, and the find_all () function takes the form of a list: the find () function returns a string because it may return more results.

BeautifulSoup.find (name,attrs,recursive,string)

Function: find the corresponding label according to the parameter and use the string to return the first value found.

Parameter: same as the find_all () method.

The above is all the contents of this article "how to use the beautifulsoup4 Library". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report