Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python beautifulsoup4 module

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use the python beautifulsoup4 module". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let the editor take you to learn how to use the python beautifulsoup4 module.

I. supplement of basic knowledge of BeautifulSoup4

BeautifulSoup4 is a python parsing library, which is mainly used to parse HTML and XML. There will be more parsing of HTML in the crawler knowledge system.

The library installation command is as follows:

Pip install beautifulsoup4

BeautifulSoup relies on third-party parsers when parsing data. Common parsers and advantages are as follows:

Python standard library html.parser:python built-in standard library, strong fault tolerance

Lxml parser: fast and fault tolerant

Html5lib: the most fault-tolerant, parsing in the same way as the browser.

Next, a custom HTML code is used to demonstrate the basic use of the beautifulsoup4 library. The test code is as follows:

Crawler lessons for testing bs4 module script erasers

Demonstrate with a custom HTML code

Use BeautifulSoup to perform simple operations on it, including instantiating BS objects, outputting page tags, and so on.

From bs4 import BeautifulSouptext_str = "" crawler lesson for testing bs4 module script eraser

Demonstrate with a custom HTML code

Demonstrate with 2 custom HTML codes

"" # instantiate the BeautifulSoup object soup = BeautifulSoup (text_str, "html.parser") # above is to format the string as a BeautifulSoup object You can format from a file # soup = BeautifulSoup (open ('test.html')) print (soup) # enter the page title title tag print (soup.title) # enter the page head tag print (soup.head) # Test input paragraph tag pprint (soup.p) # get the first one by default

We can call the web page tag directly through the BeautifulSoup object. There is a problem here. Calling the tag through the BS object can only get the tag in the first place. As in the above code, only a p tag is obtained. If you want to get more content, please continue to read.

At this point, we need to understand the four built-in objects in BeautifulSoup:

BeautifulSoup: basic object, the whole HTML object, which can be seen as a Tag object

Tag: tag object. Tags are nodes in a web page, such as title,head,p.

NavigableString: tag internal string

Comment: annotate objects. There are not many scenes used in crawlers.

The following code shows you the scenarios where these objects appear, and pay attention to the relevant comments in the code:

From bs4 import BeautifulSouptext_str = "" crawler lesson for testing bs4 module script eraser

Demonstrate with a custom HTML code

Demonstrate with 2 custom HTML codes

"" # instantiate the BeautifulSoup object soup = BeautifulSoup (text_str, "html.parser") # above is to format the string as a BeautifulSoup object You can format from a file # soup = BeautifulSoup (open ('test.html')) print (soup) print (type (soup)) # # enter the page title title tag print (soup.title) print (type (soup.title)) # print (type (soup.title.string)) # # enter the page head tag print (soup.head)

For Tag objects, there are two important properties, name and attrs

From bs4 import BeautifulSouptext_str = "" crawler lesson for testing bs4 module script eraser

Demonstrate with a custom HTML code

Demonstrate with 2 custom HTML codes

CSDN website "" # instantiate the BeautifulSoup object soup = BeautifulSoup (text_str, "html.parser") print (soup.name) # [document] print (soup.title.name) # get the tag signature titleprint (soup.html.body.a) # the lower-level tag print (soup.body.a) # html can be obtained at the tag level as a special root tag Can omit print (soup.p.a) # cannot get a tag print (soup.a.attrs) # get attribute

The above code demonstrates the use of getting the name attribute and the attrs attribute, where the attrs attribute gets a dictionary and the corresponding value can be obtained by key.

Get the attribute value of the tag. In BeautifulSoup, you can also use the following methods:

Print (soup.a ["href"]) print (soup.a.get ("href"))

After getting the NavigableString object to get the page tag, it's time to get the text inside the tag, using the following code.

Print (soup.a.string)

In addition, you can also use the text attribute and the get_text () method to get the tag content.

Print (soup.a.string) print (soup.a.text) print (soup.a.get_text ())

You can also get all the text within the tag, using strings and stripped_strings.

Print (list (soup.body.strings)) # get spaces or line breaks print (list (soup.body.stripped_strings)) # remove spaces or line breaks

Extended tag / node selector traversal document tree

Direct child node

The direct child element of the Tag object, which can be obtained using the contents and children attributes.

From bs4 import BeautifulSouptext_str = "" the crawler lesson that tests the bs4 module script eraser is the best.

Demonstrate with a custom HTML code

Demonstrate with 2 custom HTML codes

CSDN website Home blog column course "" # instantiate BeautifulSoup object soup = BeautifulSoup (text_str, "html.parser") # contents attribute gets the direct child node of the node Return content print (soup.div.contents) # return list # children attribute also gets the direct child node of the node, and returns print (soup.div.children) # in the type of generator

Note that the above two attributes obtain direct child nodes, such as the descendant tag span in the H2 tag, which will not be obtained separately.

If you want to get all the tags, use the descendants attribute, which returns a generator, and all tags, including the text within the tag, are fetched separately.

Print (list (soup.div.descendants))

Access to other nodes (just know it, check and use it)

Parent and parents: direct parent node and all parent nodes

Next_sibling,next_siblings,previous_sibling,previous_siblings: represents the next sibling node, all the sibling nodes below, the previous sibling node, and all the sibling nodes above. Since the newline character is also a node, pay attention to the newline character when using these attributes.

Next_element,next_elements,previous_element,previous_elements: these attributes represent the previous node or the next node, respectively. Note that they are not hierarchical, but for all nodes. For example, the next node of the div node in the above code is H2, while the sibling node of the div node is ul.

Document tree search related function

The first function to learn is the find_all () function. The prototype is as follows:

Find_all (name,attrs,recursive,text,limit=None,**kwargs)

Name: this parameter is the name of the tag tag. For example, find_all ('p') is to find all p tags. Tag name strings, regular expressions and lists are acceptable.

Attrs: the property passed in. This parameter can be passed in dictionary form, such as attrs= {'class':' nav'}. The returned result is a list of tag types.

Examples of the use of the above two parameters are as follows:

Print (soup.find_all ('li')) # get all the liprint (soup.find_all (attrs= {' class': 'nav'})) # pass in the attrs attribute print (soup.find_all (re.compile ("p")) # transfer regular, and the measured effect is not ideal print (soup.find_all ([' axiomachine p'])) # transfer list

Recursive: when calling the find_all () method, BeautifulSoup will retrieve all descendant nodes of the current tag. If you only want to search for the direct child nodes of tag, you can use the parameter recursive=False. The test code is as follows:

Print (soup.body.div.find_all ([`axiajiaojingp'], recursive=False)) # Transmittal list

Text: you can retrieve the text string contents in a document. Like the optional values of the name parameter, the text parameter accepts tag name strings, regular expressions, and lists.

Print (soup.find_all (text=' homepage')) # ['homepage'] print (soup.find_all (text=re.compile ("^ homepage")) # ['homepage'] print (soup.find_all (text= ["homepage", re.compile ('course)]) # [' eraser crawler class', 'home page', 'column course']

Limit: can be used to limit the number of results returned

Kwargs: if a parameter with a specified name is not a search built-in parameter name, the parameter will be searched as a property of tag. Here, you need to search by class attribute, because class is a reserved word of python. When you need to write class_, to search by class_, as long as one CSS class name is satisfied, if you need more than one CSS name, enter the same order as the tag.

Print (soup.find_all (class_ = 'nav')) print (soup.find_all (class_ =' nav li'))

It should also be noted that some attributes in the web page node cannot be used as kwargs parameters in the search, such as the data-* attribute in html5, which needs to be matched by the attrs parameter.

The list of other methods that are basically consistent with the users of the find_all () method is as follows:

Find (): function prototype find (name, attrs, recursive, text, * * kwargs), which returns a matching element

Find_parents (), find_parent (): function prototype find_parent (self, name=None, attrs= {}, * * kwargs), which returns the parent node of the current node

Find_next_siblings (), find_next_sibling (): function prototype find_next_sibling (self, name=None, attrs= {}, text=None, * * kwargs), which returns the next sibling node of the current node

Find_previous_siblings (), find_previous_sibling (): same as above, returns the previous sibling node of the current node

Find_all_next (), find_next (), find_all_previous (), find_previous (): function prototype find_all_next (self, name=None, attrs= {}, text=None, limit=None, * * kwargs) to retrieve the descendants of the current node.

The knowledge point in this section of the CSS selector is a bit of a collision with pyquery. The core can be implemented using the select () method, and the returned data is a list tuple.

Look up by signature, soup.select ("title")

Look by class name, soup.select (".nav")

Look by id name, soup.select ("# content")

Through combinatorial search, soup.select ("div#content")

Through attribute lookup, soup.select ("div [id = 'content'"), soup.select ("a [href]")

There are also some techniques that can be used when looking through attributes, such as:

^ =: you can get nodes that start with XX:

Print (soup.select ('ul [class^ = "na"]'))

* =: get the node whose attribute contains the specified character:

Print (soup.select ('ul [class * = "li"]') II. Crawler case

After mastering the basic knowledge of BeautifulSoup, it is very simple to write a crawler case. The target website to be collected this time has a large number of art QR codes, which can be used as a reference for the designer.

The following applies to the tag retrieval and attribute retrieval of the BeautifulSoup module. The complete code is as follows:

From bs4 import BeautifulSoupimport requestsimport logginglogging.basicConfig (level=logging.NOTSET) def get_html (url, headers)-> None: try: res = requests.get (url=url, headers=headers, timeout=3) except Exception as e: logging.debug ("abnormal collection", e) if res is not None: html_str = res.text soup = BeautifulSoup (html_str "html.parser") imgs = soup.find_all (attrs= {'class':' lazy'}) print ("the amount of data obtained is", len (imgs)) datas = [] for item in imgs: name = item.get ('alt') src = item ["src"] logging.info (f "{name}) {src} ") # get stitching data datas.append ((name, src)) save (datas, headers) def save (datas, headers)-> None: if datas is not None: for item in datas: try: # capture image res = requests.get (url=item [1], headers=headers) Timeout=5) except Exception as e: logging.debug (e) if res is not None: img_data = res.content with open (". / imgs/ {} .jpg" .format (item [0]) "wb+") as f: f.write (img_data) else: return Noneif _ _ name__ = ='_ main__': headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0 Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36 "} url_format =" http://www.9thws.com/#p{}" urls = [url_format.format (I) for i in range (1,2)] get_html (urls [0], headers)

The code test output is implemented by the logging module, and the effect is shown in the following figure. Only one page of data has been collected in the test. If you want to expand the collection range, you only need to modify the page number rules in the main function. In the process of writing the code, it is found that the data request is of type POST and the data return format is JSON, so this case is only used as a starting case for BeautifulSoup.

At this point, I believe you have a deeper understanding of "how to use the python beautifulsoup4 module". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report