Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to solve the problem of incomplete replacement of regular expression re.sub

2025-03-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article is about how to solve the problem of incomplete re.sub replacement of regular expressions. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Title: the phenomenon of incomplete replacement of regular expressions by re.sub and its root cause

Toc: true

Comment: true

Date: 2018-08-27 21:48:22

Tags: ["Python", "regular expression"]

Category: ["Python"]

-

Problem description

The cause of the problem comes from a regular replacement. To extract the text from a piece of HTML code and remove all HTML tags and attributes, you can write a Python function:

Import redef remove_tag (html): text = re.sub ('','', html, re.S) return text

This code uses the replacement function of regular expressions re.sub. The first argument to this function represents the regular expression of what needs to be replaced, and because the HTML tags are wrapped in angle brackets, they match all and.

The second parameter indicates what the matched content will be replaced with. Since I need to extract the body, I just need to replace all the HTML tags with empty strings. The third parameter is the text that needs to be replaced, in this case the HTML source code segment.

As for re.S, I talked about its usage in an article four years ago: https://www.jb51.net/article/146384.htm

Now test it with a piece of HTML code:

Import redef remove_tag (html): text = re.sub (',', html, re.S) return textsource_1 =''today's protagonist is kingname, let's give it up for welcome!'' Text = remove_tag (source_1) print (text)

The running effect is shown in the following figure, and the function is in line with expectations.

Let's test for newline characters in the code:

Import redef remove_tag (html): text = re.sub (',', html, re.S) return textsource_2 =''today's protagonist is kingname, let's give it up for welcome!'' Text = remove_tag (source_2) print (text)

The running effect is shown in the following figure, which is exactly in line with expectations.

After testing, in the vast majority of cases, the text can be extracted from the HTML code snippet. But there are exceptions.

Exception case

There is a long HTML code snippet that reads as follows:

Meet kingname

< img '>

Gentle # Qingnan # right here... Where's my little Marquis?

The running effect is shown in the following figure, and the last two HTML tags fail to be replaced.

At first I thought the problem was caused by spaces or quotation marks in HTML, so I simplified the HTML code:

Meet kingname

Gentle # Qingnan # right here... Where's my little Marquis?

The problem still exists, as shown in the following figure.

And what's even more surprising is that if you put the first label

If you delete it, a tag is missing from the replacement result, as shown in the following figure.

In fact, not only do you delete the first tag, but deleting any of the previous tags can reduce a tag in the result. If you delete the first two or more tags, the result will be normal.

Answer questions and questions

The root cause of this seemingly strange problem lies in the fourth parameter of re.sub. As can be seen from the function prototype:

Def sub (pattern, repl, string, count=0, flags=0)

The fourth parameter is count for the number of substitutions, and re.S should be used as the fifth parameter if you want to use it. So if you make some changes to the remove_tag function, the result is correct:

Def remove_tag (html): text = re.sub ('','', html, flags=re.S) return text

So the question is, put re.S in the position of count, why didn't the code report an error? Is re.S a number? In fact, if you print it, you will find that re.S can be used as a number:

> import re > print (int (re.S)) 16

Now go back and count the problematic HTML code and find the last two extra ones.

Tags, which happen to be the 17th and 18th tags, and because the re.S filled in by count can be treated as 16, Python will replace the first 16 tags with empty strings, leaving the last two.

So far, the cause of the problem has been clarified.

There are several reasons why this problem was not detected early:

The HTML code being replaced is a code snippet, and in most cases there are fewer than 16 HTML tags, so the problem is hidden. Re.S is an object, but it is also a number, and the parameters that count receives happen to be numbers. In many programming languages, constants use numbers and are represented by a meaningful uppercase letter. Re.S handles the case rather than\ nbut the code snippet tags tested are the second case, so whether or not to add re.S to the code snippet actually has the same effect.

Add: the following is to introduce the replacement function of regular expression re.sub ().

Re.sub () replacement function

Re.sub is a regular expression function, used to achieve through regular expressions, to achieve a more powerful replacement function than ordinary string replace. A simple replacement function can be implemented using replace ().

Def main (): text = '123, wordwords' Text1 = text.replace ('123,' Hello') print (text1) if _ _ name__ = ='_ main__': main () # Hello, wold!

If you use the re.sub (0 function, you can match any number and replace it:

Import redef main (): content = 'abc124hello46goodbye67shit' list1 = re.findall (r'\ dcards, content) print (list1) mylist = list (map (int, list1) print (mylist) print (sum (mylist)) print (re.sub (r'\ d + [HG]', 'foo1', content)) print () print (re.sub (r'\ dcards,' 456654, content)) if _ name__ = ='_ _ main__': main () # ['124', '46') '67'] # [124,46,67] # 23 thanks for reading abcfoo1ellofoo1oodbye67shit# abc456654hello456654goodbye456654shit! This is the end of this article on "how to solve the problem of incomplete re.sub replacement of regular expressions". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report