The method of locating the elements of Python crawler web pages 04/19 Update SLTechnology News&Howtos

The method of locating the elements of Python crawler web pages

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the method of element positioning of Python crawler web page. In daily operation, I believe that many people have doubts about the method of element positioning of Python crawler web page. The editor has consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "the method of element positioning of Python crawler web page". Next, please follow the editor to study!

Official site:

Www.crummy.com/software/BeautifulSoup/

Beautiful Soup is well known and easy to use in the Python crawler circle. It is a Python parsing library that converts HTML tags into Python object trees, and then lets us extract data from the object tree.

The installation of the module is simple:

Pip install bs4-I can be from any source in China.

Install any module in the future, try to use domestic sources, the speed is fast and stable.

The module package is called bs4, and you need to pay special attention when installing it.

The basic usage is as follows: "get the HTML element"res = requests.get ('https://www.crummy.com/software/BeautifulSoup/', timeout=3) return res.textif _ _ name__ =' _ _ main__': html_str = ret_html () soup = BeautifulSoup (html_str, 'lxml') print (soup)"

It should be noted that the module imports the code, and when instantiating the soup object, two parameters are passed in the constructor of the BeautifulSoup class, one is the string to be parsed, the other is the parser, and the official recommendation is lxml because of its fast parsing speed.

The output of the above code is shown below, which looks like a normal HTML code file.

And we can call the soup.prettify () method of the soup object to format the HTML tag so that you can make its HTML code look good when you save it to an external file.

Object description of the BeautifulSoup module

The BeautifulSoup class can parse HTML text into a tree of Python objects, which includes the four most important objects, which are Tag,NavigableString,BeautifulSoup,Comment objects, which we will introduce one by one next.

BeautifulSoup object

The object itself represents the entire HTML page, and when the object is instantiated, the HTML code is automatically completed.

Html_str = ret_html () soup = BeautifulSoup (html_str, 'lxml') print (type (soup)) Tag object

Tag means tags. A Tag object is a page tag, or a page element object, such as getting the H2 tag object of the bs4 official website. The code is as follows:

If _ _ name__ = ='_ main__': html_str = ret_html () soup = BeautifulSoup (html_str, 'lxml') # print (soup.prettify ()) # format HTML print (soup.h2)

What you get is also the H2 tag in the web page:

Beautiful Soup

With the type function in Python, you can view its type as follows:

Print (soup.h2) print (type (soup.h2))

Instead of a string, you get a Tag object.

Beautiful Soup

Since it is a Tag object, it will have some specific property values

Get tag name

Print (soup.h2) print (type (soup.h2)) print (soup.h2.name) # get the tag name

Get the attribute value of the tag through the Tag object

Print (soup.img) # get the first img tag of the page print (soup.img ['src']) # get the attribute value of the page element DOM

Get all the attributes of the tag through the attrs attribute

Print (soup.img) # get the first img tag of the page print (soup.img.attrs) # get all the attribute values of the page element and return it as a dictionary

All the output of the above code is shown below, and you can choose any tag to practice.

Beautiful Souph2

{'align':' right', 'src':' 10.1.jpglegs, 'width':' 250'} NavigableString object

The NavigableString object gets the text content inside the tag, such as the p tag, which is extracted in the following code that I am an eraser

I'm an eraser.

It is also very easy to get this object, using the string property of the Tag object.

Nav_obj = soup.h2.string print (type (nav_obj))

The output is as follows:

If the target tag is a single tag, the None data will be obtained

In addition to using the object's string method, you can also use the text property and the get_text () method to get the tag content

Print (soup.h2.text) print (soup.p.get_text) print (soup.p.get_text ('&'))

Where text is a merged string that gets the contents of all child tags, and get_text () has the same effect, but you can use get_text () to add a delimiter, such as the & symbol in the above code, and you can also use the strip=True parameter to remove spaces.

Comment object

BeautifulSoup objects and Tag objects support tag lookup methods, as shown below.

Find () method and find_all () method

You can find the specified object in the web page by calling the find () method of the BeautifulSoup object and the Tag object

The syntax format of this method is as follows:

Obj.find (name,attrs,recursive,text,**kws)

The return result of the method is the first element found, or None if it is not found. The parameters are described as follows:

Name: tag name

Attrs: tag attribut

Recursive: search all descendant elements by default

Text: tag content.

For example, we continue to look for the a tag in the page requested above, with the code as follows:

Html_str = ret_html () soup = BeautifulSoup (html_str, 'lxml') print (soup.find (' a'))

You can also use the attrs parameter to find it. The code is as follows:

Html_str = ret_html () soup = BeautifulSoup (html_str, 'lxml') # print (soup.find (' a')) print (soup.find (attrs= {'class':' cta'}))

The find () method also provides some special parameters that are easy to find directly. For example, you can use id=xxx to find tags in attributes that contain id, and you can use class_=xxx to find tags in attributes that contain class.

Print (soup.find (class_='cta'))

What appears in pairs with the find () method is the find_all () method. By looking at the name, you can see that the return result is a full matching tag. The syntax format is as follows:

Obj.find_all (name,attrs,recursive,text,limit)

The emphasis is on the limit parameter, which represents the maximum number of matches returned, and the find () method can be thought of as limit=1, which makes it easy to understand.

At this point, on the "Python crawler web page element positioning method" study is over, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.