Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to parse html in Python3

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)05/31 Report--

Most people do not understand the knowledge points of this article "how to parse html in Python3", so the editor summarizes the following contents, detailed contents, clear steps, and certain reference value. I hope you can get something after reading this article. Let's take a look at this "how to analyze html in Python3" article.

Auxiliary function, mainly used to obtain html and enter the end of parsing

# pass the parsing function to facilitate the following modification of def get_html (url, paraser=bs4_paraser): headers = {'Accept':' * / *', 'Accept-Encoding':' gzip, deflate, sdch', 'Accept-Language':' zh-CN,zh;q=0.8', 'Host':' www.360kan.com', 'Proxy-Connection':' keep-alive', 'User-Agent':' Mozilla/5.0 (Windows NT WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'} request = urllib2.Request (url) Headers=headers) response = urllib2.urlopen (request) response.encoding = 'utf-8' if response.code = = 200: data = StringIO.StringIO (response.read ()) gzipper = gzip.GzipFile (fileobj=data) data = gzipper.read () value = paraser (data) # open (' Elux) h6haPkY0osd0r5UB.html'). Read () return value else: pass value = get_html ('http://www.360kan.com/m/haPkY0osd0r5UB.html', paraser=lxml_parser) for row in value: print row

1the way of parsing lxml.html

Def lxml_parser (page): data = [] doc = etree.HTML (page) all_div = doc.xpath ('/ / div [@ class= "yingping-list-wrap"]') for row in all_div: # get every review That is, item all_div_item = row.xpath ('. / / div [@ class= "item"]') # find_all ('div' Attrs= {'class':' item'}) for r in all_div_item: value = {} # get the title section of the review title = r.xpath ('. / div [@ class= "g-clear title-wrap"] [1]') value ['title'] = title [0] .xpath ('. / a/text ()') [0] value ['title_href'] = title [0] .xpath ('. / a) @ Href') [0] score_text = title [0] .XPath ('. / div/span/span/@style') [0] score_text = re.search (r'\ dbath') Score_text) .group () value ['score'] = int (score_text) / 20 # time value [' time'] = title [0] .XPath ('. / div/span [@ class= "time"] / text ()') [0] # how many people like value ['people'] = int (re.search (r'\ dcards') Title [0] .XPath ('. / div [@ class= "num"] / span/text ()') [0] .group () data.append (value) return data

2, use BeautifulSoup, say no more, look for information on the Internet

Def bs4_paraser (html): all_value = [] value = {} soup = BeautifulSoup (html, 'html.parser') # get reviews part all_div = soup.find_all (' div', attrs= {'class':' yingping-list-wrap'}, limit=1) for row in all_div: # get each review That is, item all_div_item = row.find_all ('div', attrs= {' class': 'item'}) for r in all_div_item: # get the title part of the review title = r.find_all (' div', attrs= {'class':' g-clear title-wrap'}) Limit=1) if title is not None and len (title) > 0: value ['title'] = title [0] .a.string value [' title_href'] = title [0] .a ['href'] score_text = title [0] .div.span.span [' style'] score_text = re.search (r'\ dcards') Score_text) .group () value ['score'] = int (score_text) / 20 # time value [' time'] = title [0] .div.find _ all ('span', attrs= {' class': 'time'}) [0] .string # how many people like value [' people'] = int (re.search (r'\ dflowers), title [0] .find _ all ('div') Attrs= {'class':' num'}) [0] .span.string) .group () # print r all_value.append (value) value = {} return all_value

3. Using SGMLParser, mainly through start and end tag, the parsing project is relatively clear, but it is a bit troublesome, and the scenario of this case is not very suitable for this method. ()

Class CommentParaser (SGMLParser): def _ init__ (self): SGMLParser.__init__ (self) self.__start_div_yingping = False self.__start_div_item = False self.__start_div_gclear = False self.__start_div_ratingwrap = False # a self.__start_a = False # span 3 status self.__span_state = 0 # data self. _ _ value = {} self.data = [] def start_div (self Attrs): for k V in attrs: if k = = 'class' and v = =' yingping-list-wrap': self.__start_div_yingping = True elif k = = 'class' and v = =' item': self.__start_div_item = True elif k = = 'class' and v = =' g-clear title-wrap': self.__start_div_gclear = True elif k = = 'class' and v = =' rating-wrap gmurf clearances: self.__start _ div_ratingwrap = True elif k = 'class' and v = =' num': self.__start_div_num = True def end_div (self): if self.__start_div_yingping: if self.__start_div_item: if self.__start_div_gclear: if self.__start_div_num or self.__start_div_ratingwrap: if self.__start_div_num: self. _ _ start_div_num = False if self.__start_div_ratingwrap: self.__start_div_ratingwrap = False else: self.__start_div_gclear = False else: self.data.append (self.__value) self.__value = {} self.__start_div_item = False else: self.__start_div_yingping = False def start_a (self Attrs): if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear: self.__start_a = True for k V in attrs: if k = 'href': self.__value [' href'] = v def end_a (self): if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a: self.__start_a = False def start_span (self) Attrs): if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear: if self.__start_div_ratingwrap: if self.__span_state! = 1: for k V in attrs: if k = = 'class' and v = =' rating': self.__span_state = 1 elif k = = 'class' and v = =' time': self.__span_state = 2 else: for k, v in attrs: if k = = 'style': score_text = re.search (r'\ dcards') V) .group () self.__value ['score'] = int (score_text) / 20 self.__span_state = 3 elif self.__start_div_num: self.__span_state = 4 def end_span (self): self.__span_state = 0 def handle_data (self) Data): if self.__start_a: self.__value ['title'] = data elif self.__span_state = 2: self.__value [' time'] = data elif self.__span_state = = 4: score_text = re.search (r'\ dcards, data). Group () self.__value ['people'] = int (score_text) passdef sgl_parser (html): parser = CommentParaser () parser.feed (html) return parser.data

4Jet HTMLParaer, which is familiar with the third principle, that is, the method of calling is different and can basically be used in common.

Class CommentHTMLParser (HTMLParser.HTMLParser): def _ init__ (self): HTMLParser.HTMLParser.__init__ (self) self.__start_div_yingping = False self.__start_div_item = False self.__start_div_gclear = False self.__start_div_ratingwrap = False # a self.__start_a = False # span 3 status self.__span_state = 0 # number According to self.__value = {} self.data = [] def handle_starttag (self Tag, attrs): if tag = = 'div': for k V in attrs: if k = = 'class' and v = =' yingping-list-wrap': self.__start_div_yingping = True elif k = = 'class' and v = =' item': self.__start_div_item = True elif k = = 'class' and v = =' g-clear title-wrap': self.__start_div_gclear = True elif k = = 'class' and v = =' rating-wrap gmurf clearances: self.__ Start_div_ratingwrap = True elif k = = 'class' and v = =' num': self.__start_div_num = True elif tag = = 'asides: if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear: self.__start_a = True for k V in attrs: if k = = 'href': self.__value [' href'] = v elif tag = = 'span': if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear: if self.__start_div_ratingwrap: if self.__span_state! = 1: for k V in attrs: if k = = 'class' and v = =' rating': self.__span_state = 1 elif k = = 'class' and v = =' time': self.__span_state = 2 else: for k, v in attrs: if k = = 'style': score_text = re.search (r'\ dcards') V) .group () self.__value ['score'] = int (score_text) / 20 self.__span_state = 3 elif self.__start_div_num: self.__span_state = 4 def handle_endtag (self) Tag): if tag = = 'div': if self.__start_div_yingping: if self.__start_div_item: if self.__start_div_gclear: if self.__start_div_num or self.__start_div_ratingwrap: if self.__start_div_num: self.__start_div_num = False if self.__start_div_ratingwrap: Self.__start_div_ratingwrap = False else: self.__start_div_gclear = False else: self.data.append (self.__value) self.__value = {} self.__start_div_item = False else: self.__start_div_yingping = False elif tag = = 'asides: if self.__start_div_yingping and self.__start_div_item and self. _ _ start_div_gclear and self.__start_a: self.__start_a = False elif tag = = 'span': self.__span_state = 0 def handle_data (self Data): if self.__start_a: self.__value ['title'] = data elif self.__span_state = = 2: self.__value [' time'] = data elif self.__span_state = = 4: score_text = re.search (r'\ time'' Data) .group () self.__value ['people'] = int (score_text) passdef html_parser (html): parser = CommentHTMLParser () parser.feed (html) return parser.data above is the content of the article "how to parse html in Python3" I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report