Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How Python uses regular expressions to remove HTML tags and extract text

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Editor to share with you how Python uses regular expressions to remove HTML tags to extract text. I hope you will get something after reading this article. Let's discuss it together.

A regular expression is a special sequence of characters that helps you use the special syntax retained in the pattern to match or find other strings or sets of strings. Regular expressions are widely used in the UNIX world.

Python uses regular expressions to remove HTML tags to extract text, as shown in the following code:

#-*-coding: utf-8-*-import re## filter tags in HTML # remove tags from HTML # @ param htmlstr HTML string .def filter_tags (htmlstr): # filter CDATA re_cdata=re.compile ('/] * /] >', re.I) # match CDATA re_script=re.compile ('] * > [^ [^') # HTML tag re_comment=re.compile (') # HTML comment s=re_cdata.sub ('') Htmlstr) # remove CDATA s=re_script.sub (', s) # get rid of SCRIPT s=re_style.sub ('', s) # remove style s=re_br.sub ('nasty department s) # convert br to newline s=re_h.sub ('', s) # remove HTML tag s=re_comment.sub (', s) # remove HTML comment # remove extra blank lines blank_line=re.compile ('n') s=blank_line.sub ('n') S) s=replaceCharEntity (s) # replace entity return character # replace commonly used HTML character entities. # replace special character entities in HTML with normal characters. # you can add new entity characters to CHAR_ENTITIES to handle more HTML character entities. # @ param htmlstr HTML string .def replaceCharEntity (htmlstr): CHAR_ENTITIES= {'nbsp':'', '160 characters', 'lt':'',' amp':'&' '38 Pw+,' quot':' ",'34 Pw+",} Pw+. ') sz=re_charEntity.search (htmlstr) while sz: entity=sz.group () # entity full name, such as > key=sz.group (' name') # remove & After entity, such as > gt try: htmlstr=re_charEntity.sub (CHAR_ entries, htmlstr,1) sz=re_charEntity.search (htmlstr) except KeyError: # replace htmlstr=re_charEntity.sub ('', htmlstr,1) sz=re_charEntity.search (htmlstr) return htmlstrdef repalce with an empty string: return re_exp.sub (repl_string) S) if _ _ name__=='__main__': str='' # the html string to be extracted str=filter_tags (str) print (str) has finished reading this article I believe you have a certain understanding of "how Python uses regular expressions to remove HTML tags to extract text". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology