In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the python crawler exception capture and tag filtering method related knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe you will have something to gain after reading this python crawler exception capture and tag filtering method article, let's take a look at it.
Add exception capture to make it easier to solve the problem import sslimport urllib.requestfrom bs4 import BeautifulSoupfrom urllib.error import HTTPError, URLError def get_data (url): headers = {"user-agent": "Mozilla/5.0 (Macintosh) Intel Mac OS X 10: 15: 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} ssl._create_default_https_context = ssl._create_unverified_context" urlopen adds two exception traps: 1. If there is an error on the page or the server does not exist. Will throw HTTP error code 2, if the url is written wrong or the link cannot be opened Throw URLError error "" try: url_obj = urllib.request.Request (url, headers=headers) response = urllib.request.urlopen (url_obj) html = response.read () .decode ('utf8') except (HTTPError, URLError) as e: raise e "BeautifulSoup adds exception catch because sometimes the BeautifulSoup object returns a value of None when the tag does not actually exist Because you don't know, calling it will cause AttributeError: 'NoneType' object has no xxxxxxx to be thrown. "" try: bs = BeautifulSoup (html, "html.parser") results = bs.body except AttributeError as e: return None return results if _ _ name__ = ='_ main__': print (get_data ("https://movie.douban.com/chart"))")
Parsing html to better realize the effect of data display
Get_text (): get text information
# the code here is consistent with the above open url code, so it is omitted here. Html = response.read (). Decode ('utf8') bs = BeautifulSoup (html, "html.parser") data = bs.find (' span', {'class':' pl'}) print (f 'movie reviews: {data}') print (f 'film reviews: {data.get_text ()}')
The results after running are shown as follows:
Number of film reviews: (38054 people) number of film reviews: (38054 people)
The find () method filters the HTML tag to find the single tag you need
The actual find method encapsulation calls the regular find_all method, passing the limt parameter in find_all 1 to get a single tag
1.name: can be directly understood as a tag element
2.attrs: dictionary format, put attributes and attribute values {"class": "indent"}
3.recursive: recursive parameter, Boolean value, recursive query subtag when true
4.text: the text content of the tag matches, it is the text of the label, the text of the label
The find_all () method filters the HTML tag to find the tag group you need
The use method is the same as that for find, except for the addition of a limit parameter (filtering data).
Small knowledge points that must be paid attention to:
# the following two ways of writing actually have the same function, which is to query the attribute value bs.find_all (id= "text") bs.find_all (', {"id": "text"}) where id is text. If it is class, you cannot class= "x x x". Because class is the keyword bs.find_all (class_= "text") bs.find_all (', {"class": "text"}) of the class in python, this article ends here. Thank you for reading! I believe you all have a certain understanding of the knowledge of "exception capture and tag filtering in python crawler". If you want to learn more, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.