Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How the python crawler uses the BeautifulSoup library

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge of "how python crawler uses BeautifulSoup library". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The basic elements of the BeautiSoup class and the basic methods of traversing HTML with bs4

1 the basic elements of the BeautifulSoup class

There are five basic elements:

Tag: tag, the most basic information organization unit.

Name: tag name,.. name: returns a string

Attributes: attribute of the tag,.. attrs: returns a dictionary type

NavigableString: the string in the tag,.. string: returns a string

Comment: the comment information in the tag will be obtained in.. string.

For example, there is now an HTML document:

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python

And

Advanced Python.

This is not a Comment.

We apply this document with the basic elements of the BS class:

First make a pot of soup: > soup = BeautifulSoup (html, "html.parser")

Use. Get all the information about a tag:

Such as: > soup.p

Use .name to get the name of the tag:

Such as: > p.name

Another important use of the Tag attribute is that when we traverse the child nodes within a tag, there may be nodes that are not tags, such as'\ n'. In this case, we can use the bs4.element.Tag attribute provided by the bs4 library to identify.

The isinstance () method is used. For example: isinstance (soup.a, bs4.element.Tag)

Use .attrs to get the attributes of this tag:

Such as > soup.a.attrs

You can obtain the specific attributes of the a tag: > soup.a ['href']

Use .string to get the string information in this tag:

Such as: > soup.p.string:

We see that .string can cross the tag and get the string within the tag directly.

However: when the outer tag contains multiple parallel inner tags, it will not work:

For example: which is a good http://m.zzzy120.com/ for Zhengzhou abortion Hospital?

There are several parallel a tags in this p tag, which are returned as None directly with p.string.

Finally, talk about the Comment attribute

You can use > soup.b.string to get annotated information:

The two types are different, so you can use the isinstance (, bs4.element.Comment) of the bs4 library to filter the comment information or get the comment information.

2, traversing HTML with BeautifulSoup

1, traverse the following link:

3 attributes:

1. Contents: a list of child nodes, with all child nodes in the list (including all\ n)

2. Children: the iteration type of the child node, similar to .contents, used to loop through the son node.

3, .descendants: contains all descendant nodes. Like children, it can only be used for iterations.

2, uplink traversal:

2 attributes:

Parent: the parent tag of the node

2recover.ancestor: the iteration type of the ancestor of the node, which is used to cycle through the ancestor node (all ancestors).

3, parallel traversal:

4 parallel traversal attributes: (need to occur between nodes under the same parent node.)

NextNode sibling: returns the next parallel node tag in HTML text order.

2. Previousroomsibling: returns. Gets or sets the last parallel node label of the

3.nexttransisiblings: iteration type that returns all parallel node tags in order.

4. Previousfantsibings: iterative type.

3Formenting output and coding of BeautifulSoup:

Prettify () function: can add newline characters to HTML tags and text

1. All the information in the world can be marked in the form of tags in 3.

XML format: similar to HTML, label form.

JSON: there are types of key values for key:value, such as:

"name": "Beijing" -; a key-value pair

"name": ["hello", "hello"] -: a key corresponds to multiple values of 1

"name": {"subkey": "subvalue"} -: a key-value pair corresponds to a key as a value

YAML: untyped key-value pair: key: value

Name: Beijing

Name:

-newName: Beijing

-newName: Shanghai (tied)

Name:

Newname: hello (nested)

Comparison:

XML tags: extensible, but tedious. XML is often used for information exchange and transmission on Internet.

JSON uses "": information is typed, suitable for program processing (js), and cleaner than XML. (cannot be annotated) used for information communication between mobile application clouds and nodes. It is used to process the interface of the program.

YAML uses indentation: no type of information, the highest proportion of text information, good readability. Configuration files of all kinds of systems are easy to read with comments.

2, the way of extracting information

1. Completely parse the marked form of the information, and then extract the key information.

2, ignore the form of tags and search for key information directly.

3. Fusion method.

For example: extract all URL tags in HTML.

Ideas):

1. Search all tags

2. Parse the tag format and extract the linked content after href.

3DFINDING all () method

The most commonly used search method

.find _ all (name, attrs, recursive, string, * * kwargs)

Returns a list type that stores the results of the search.

How to use it:

Name,attrs, recursive, string, etc. can be specified separately.

Label signature, attribute (or dictionary form of key-value pair), default is to search for descendant nodes, string.

(..) Equivalent to .find _ all (..)

Soup (..) Equivalent to soup.find_all (..)

The .find () search returns only one result, a string result.

That's all for "how python crawlers use the BeautifulSoup library". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report