Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Methods of exception capture and tag filtering in python crawler

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the python crawler exception capture and tag filtering method related knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe you will have something to gain after reading this python crawler exception capture and tag filtering method article, let's take a look at it.

Add exception capture to make it easier to solve the problem import sslimport urllib.requestfrom bs4 import BeautifulSoupfrom urllib.error import HTTPError, URLError def get_data (url): headers = {"user-agent": "Mozilla/5.0 (Macintosh) Intel Mac OS X 10: 15: 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} ssl._create_default_https_context = ssl._create_unverified_context" urlopen adds two exception traps: 1. If there is an error on the page or the server does not exist. Will throw HTTP error code 2, if the url is written wrong or the link cannot be opened Throw URLError error "" try: url_obj = urllib.request.Request (url, headers=headers) response = urllib.request.urlopen (url_obj) html = response.read () .decode ('utf8') except (HTTPError, URLError) as e: raise e "BeautifulSoup adds exception catch because sometimes the BeautifulSoup object returns a value of None when the tag does not actually exist Because you don't know, calling it will cause AttributeError: 'NoneType' object has no xxxxxxx to be thrown. "" try: bs = BeautifulSoup (html, "html.parser") results = bs.body except AttributeError as e: return None return results if _ _ name__ = ='_ main__': print (get_data ("https://movie.douban.com/chart"))")

Parsing html to better realize the effect of data display

Get_text (): get text information

# the code here is consistent with the above open url code, so it is omitted here. Html = response.read (). Decode ('utf8') bs = BeautifulSoup (html, "html.parser") data = bs.find (' span', {'class':' pl'}) print (f 'movie reviews: {data}') print (f 'film reviews: {data.get_text ()}')

The results after running are shown as follows:

Number of film reviews: (38054 people) number of film reviews: (38054 people)

The find () method filters the HTML tag to find the single tag you need

The actual find method encapsulation calls the regular find_all method, passing the limt parameter in find_all 1 to get a single tag

1.name: can be directly understood as a tag element

2.attrs: dictionary format, put attributes and attribute values {"class": "indent"}

3.recursive: recursive parameter, Boolean value, recursive query subtag when true

4.text: the text content of the tag matches, it is the text of the label, the text of the label

The find_all () method filters the HTML tag to find the tag group you need

The use method is the same as that for find, except for the addition of a limit parameter (filtering data).

Small knowledge points that must be paid attention to:

# the following two ways of writing actually have the same function, which is to query the attribute value bs.find_all (id= "text") bs.find_all (', {"id": "text"}) where id is text. If it is class, you cannot class= "x x x". Because class is the keyword bs.find_all (class_= "text") bs.find_all (', {"class": "text"}) of the class in python, this article ends here. Thank you for reading! I believe you all have a certain understanding of the knowledge of "exception capture and tag filtering in python crawler". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report