Example Analysis based on xpath Selector, PyQuery and regular expression format cleaning tool 04/26 Update SLTechnology News&Howtos

Example Analysis based on xpath Selector, PyQuery and regular expression format cleaning tool

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is about sample analysis based on xpath selectors, PyQuery, and regular expression format cleanup tools. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Use xpath to clean up unnecessary tag elements and no content tags

From lxml import etree def xpath_clean (self, text: str Xpath_dict: dict)-> str:''xpath remove unnecessary elements: param text: html_content: param xpath_dict: clear target xpath: return: string type html_content' 'remove_by_xpath = xpath_dict if xpath_dict else dict () # items that must be cleared unless in extreme cases these are remove_by_ to be cleared Xpath.update ({'_ remove_2':'/ / iframe') '_ remove_4':' / / button','_ remove_5':'/ / form','_ remove_6':'/ / input','_ remove_7':'/ / select','_ remove_8':'/ / option','_ remove_9':'/ / textarea','_ remove_10':'/ / figure' '_ remove_11':' / / figcaption','_ remove_12':'/ / frame','_ remove_13':'/ / video','_ remove_14':'/ / script','_ remove_15':'/ / style'}) parser= etree.HTMLParser (remove_blank_text=True, remove_comments=True) selector = etree.HTML (text, parser=parser) # General deletion operation Unwanted tags delete for xpath in remove_by_xpath.values (): for bad in selector.xpath (xpath): bad_string = etree.tostring (bad, encoding='utf-8') Pretty_print=True) .decode () logger.debug (f "clean article content: {bad_string}") bad.getparent () .remove (bad) skip_tip = "name () = 'img' or name () =' tr' or"\ "name () = 'th' or name () =' tbody' or"\ "name () = 'thead' or Name () = 'table' "# judge all p tags Is there any content? Delete for p in selector.xpath directly (f "/ / * [not ({skip_tip})]"): # Skip logical if p.xpath (f ". / / * [{skip_tip}]") or\ bool (re.sub ('\ slots,', p.xpath ('string (.)')): continue bad_p = etree.tostring (p, encoding='utf-8') Pretty_print=True) .decode () logger.debug (f "clean p tag: {bad_p}") p.getparent () .remove (p) return etree.tostring (selector, encoding='utf-8', pretty_print=True) .decode ()

2. Use pyquery to clean up the tag attributes and return the processed source code and pure text

#! / usr/bin/env python#-*-coding:utf-8-*- from pyquery import PyQuery as pq def pyquery_clean (self, text, url, pq_dict)-> object:''pyquery to make the necessary processing : param text:: param url:: param pq_dict:: return:''# Delete pq expression dictionary remove_by_pq = pq_dict if pq_dict else dict () # tag attribute whitelist attr_white_list = ['rowspan',' colspan'] # picture link key img_key_list = ['src',' data-echo', 'data-src' 'data-original'] # generate pyquery object dom = pq (text) # Delete useless tags for bad_tag in remove_by_pq.values (): for bad in dom (bad_tag): bad_string = pq (bad). Html () logger.debug (f "clean article content: {bad_string}") dom.remove (bad_tag) # label Sign each attribute to handle for tag in dom ('*'): for key Value in tag.attrib.items (): # Skip logic Keep the rowspan and colspan properties of the table if key in attr_white_list: continue # handles picture links, incomplete url Replace if key in img_key_list after completion: img_url = self.absolute_url (url, value) pq (tag). Remove_attr (key) pq (tag) .attr ('src', img_url) pq (tag) .attr (' alt') '') # the alt attribute of the img tag is left empty elif key = = 'alt': pq (tag) .attr (key,'') # delete all remaining attributes else: pq (tag). Remove_attr (key) return dom.text (), dom.html ()

3, regular expression cleans up spaces and newline content

#! / usr/bin/env python#-*-coding:utf-8-*- import re def regular_clean (self, str1: str, str2: str):''regular expression processing data format: param str1: content: param str2: html_content: return: return the processed result' 'def new_line (text): text = re.sub ('','

', text) text = re.sub (' |', text) text = re.sub ('\ n,', text) text = re.sub ('','

Text) text = re.sub (','

, text) text = text.replace ('

','

\ n') .replace ('

At the end, each method encapsulates the class code display

#! / usr/bin/env python#-*-coding:utf-8-*-'''author: szhandate:2020-08-17summery: clean up html_conent and get pure data format''import refrom lxml import etreefrom pyquery import PyQuery as pqfrom urllib.parse import urlsplit, urljoin from loguru import logger class CleanArticle: def _ _ init__ (self, text: str, url: str ='', xpath_dict: dict = None Pq_dict: dict = None): self.text = text self.url = url self.xpath_dict = xpath_dict or dict () self.pq_dict = pq_dict or dict () @ staticmethod def absolute_url (baseurl: str Url: str)-> str:''add url: param baseurl:scheme url: param url: target url: return: complete url' 'target_url = url if urlsplit (url). Scheme else urljoin (baseurl, url) return target_url @ staticmethod def clean_blank (text):' 'Blank handling: param text:: return:' 'text = text.replace (' ','). Replace (\ u3000yuan,'). Replace ('\ xa0','). Replace ('\ xa0',') text = re.sub ('\ s {2,}',', text) text = re.sub ('\ n {2,}','\ n' Text) text = text.strip ('\ n'). Strip () return text def run (self):'': return: processed content, html_content''if (not bool (self.text)) or (not isinstance (self.text, str)): raise ValueError ('html_content has a bad type value') # first Use xpath to remove spaces, as well as comments, iframe, button, form, script, style, video and other tags text = self.xpath_clean (self.text, self.xpath_dict) # step 2 Use pyquery to deal with specific details str1, str2 = self.pyquery_clean (text, self.url, self.pq_dict) # final regular processing content, html_content = self.regular_clean (str1, str2) return content, html_content def xpath_clean (self, text: str Xpath_dict: dict)-> str:''xpath remove unnecessary elements: param text: html_content: param xpath_dict: clear target xpath: return: string type html_content' 'remove_by_xpath = xpath_dict if xpath_dict else dict () # items that must be cleared unless in extreme cases these are remove_by_ to be cleared Xpath.update ({'_ remove_2':'/ / iframe') '_ remove_4':' / / button','_ remove_5':'/ / form','_ remove_6':'/ / input','_ remove_7':'/ / select','_ remove_8':'/ / option','_ remove_9':'/ / textarea','_ remove_10':'/ / figure' '_ remove_11':' / / figcaption','_ remove_12':'/ / frame','_ remove_13':'/ / video','_ remove_14':'/ / script','_ remove_15':'/ / style'}) parser= etree.HTMLParser (remove_blank_text=True, remove_comments=True) selector = etree.HTML (text, parser=parser) # General deletion operation Unwanted tags delete for xpath in remove_by_xpath.values (): for bad in selector.xpath (xpath): bad_string = etree.tostring (bad, encoding='utf-8') Pretty_print=True) .decode () logger.debug (f "clean article content: {bad_string}") bad.getparent () .remove (bad) skip_tip = "name () = 'img' or name () =' tr' or"\ "name () = 'th' or name () =' tbody' or"\ "name () = 'thead' or Name () = 'table' "# judge all p tags Is there any content? Delete for p in selector.xpath directly (f "/ / * [not ({skip_tip})]"): # Skip logical if p.xpath (f ". / / * [{skip_tip}]") or\ bool (re.sub ('\ slots,', p.xpath ('string (.)')): continue bad_p = etree.tostring (p, encoding='utf-8') Pretty_print=True) .decode () logger.debug (f "clean p tag: {bad_p}") p.getparent (). Remove (p) return etree.tostring (selector, encoding='utf-8', pretty_print=True). Decode () def pyquery_clean (self, text, url, pq_dict)-> object:''pyquery to make the necessary processing : param text:: param url:: param pq_dict:: return:''# Delete pq expression dictionary remove_by_pq = pq_dict if pq_dict else dict () # tag attribute whitelist attr_white_list = ['rowspan',' colspan'] # picture link key img_key_list = ['src',' data-echo', 'data-src' 'data-original'] # generate pyquery object dom = pq (text) # Delete useless tags for bad_tag in remove_by_pq.values (): for bad in dom (bad_tag): bad_string = pq (bad). Html () logger.debug (f "clean article content: {bad_string}") dom.remove (bad_tag) # label Sign each attribute to handle for tag in dom ('*'): for key Value in tag.attrib.items (): # Skip logic Keep the rowspan and colspan properties of the table if key in attr_white_list: continue # handles picture links, incomplete url Replace if key in img_key_list after completion: img_url = self.absolute_url (url, value) pq (tag). Remove_attr (key) pq (tag) .attr ('src', img_url) pq (tag) .attr (' alt') '') # the alt attribute of the img tag is left empty elif key = = 'alt': pq (tag) .attr (key,'') # delete all remaining attributes else: pq (tag). Remove_attr (key) return dom.text (), dom.html () def regular_clean (self, str1: str Str2: str):''regular expression processing data format: param str1: content: param str2: html_content: return: return the processed result' 'def new_line (text): text = re.sub ('','

', text) text = re.sub (' |', text) text = re.sub ('\ n,', text) text = re.sub ('','

Text) text = re.sub (','

, text) text = text.replace ('

','

\ n') .replace ('

',') return text str1, str2 = self.clean_blank (str1), self.clean_blank (str2) # TODO handles blank line issues # TODO html_content processing 1, remove redundant unusable tags and tags that affect data display 2 Newline character problem handling and replacement str2 = new_line (text=str2) return str1, str2 if _ _ name__ = ='_ main__': with open ('html_content.html',' ritual, encoding='utf-8') as f: lines = f.readlines () html =''for line in lines: html + = line ca = CleanArticle (text=html) _, html_content = ca.run () print (html_content) Thank you for reading! This is the end of this article on "sample analysis based on xpath selector, PyQuery, and regular expression format cleaning tools". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.