Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of balance Group in regular expression

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the example analysis of the balance group in the regular expression, which is very detailed and has a certain reference value. Friends who are interested must read it!

Introduction to balance groups in general regular tutorials

If you want to match a nested hierarchy, you have to use a balance group. For example, how do you capture the longest angle brackets in a string like "xx yy"?

The following syntax constructs need to be used here:

(?) Name the captured content group and press it into the stack

(?) The capture named group, which was last pushed into the stack, is popped from the stack. If the stack is empty, the match of this packet fails.

(? (group) yes | no) if there is a capture named group on the stack, continue to match the expression of the yes part, otherwise continue to match the no part

(?) Sequential negative look around, because there is no suffix expression, the attempt to match always fails

If you are not a programmer (or you are a programmer who is not familiar with the concept of stack), you can understand the above three grammars like this: the first is to write a "group" on the blackboard (or write another), the second is to erase a "group" from the blackboard, the third is to see if there is still a "group" written on the blackboard, and if so, continue to match the yes section, otherwise match the no section.

What we need to do is to write a "group" on the blackboard every time we encounter a left parenthesis, erase it every time we encounter a closing parenthesis, and finally see if there is any on the blackboard-if so, it proves that there are more left parentheses than right parentheses, then the match should fail (I used the syntax (? 'group') to see more clearly):

< #最外层的左括号 [^]* #最外层的左括号后面的不是括号的内容 ( ( (?'Open']* #匹配左括号后面的不是括号的内容 )+ ( (?'-Open'>

) # if you encounter the closing parenthesis, erase a "Open" [^] * # matches the contents of the closing parenthesis that is not followed by parentheses) +) * (? (Open) (?)) # determine whether there is still a "Open" on the blackboard before the right parenthesis of the outermost layer is encountered; if so, the match fails > # the outermost right parenthesis

Why am I writing this article?

After reading the introduction above, do you understand? Before I understand the principle of regular expression matching, the above introduction to the balance group seems to be incomprehensible, and can only be remembered as a template, not flexibly used. Therefore, access to a large number of information on regular aspects, here especially thanks to lxcnn's technical documentation and "proficient regular expressions" this book, so that I have a more in-depth, more systematic understanding of regular expressions, so, on the basis of them, I will combine their own learning experience to do a summary, as a learning note archive, in addition, if you can solve your doubts, is also a happy thing.

I will not analyze the above code for the time being. I will first explain the concepts and knowledge related to the balance group.

The following expression matching test tool is: Expresso, which is also available for download on this site.

The concept and function of equilibrium group

The balance group, hence the meaning of the name, balance is symmetry, which mainly combines several regular grammar rules to provide matching for the nested structures that appear in pairs. Balance group has two definitions: narrow sense and broad sense. Narrow sense balance group refers to Expression grammar, while generalized balance group is not a fixed grammar rule, but a comprehensive application of several grammar rules. What we usually call balance group usually refers to generalized balance group. In this paper, unless otherwise specified, the abbreviation "equilibrium group" refers to the generalized equilibrium group.

Matching principle of balance group

The matching principle of the balance group can be explained by the stack, first give an example, and then explain according to the example.

Source string: a + (b* (crapd)) / eqif-(g / (hmuri)) * j

Regular expression: (?\ () | (?) | [^ ()]) * ((Open) (?))\)

Requirements description: match the content in () that appears in pairs

Output: (b* (celecd)) and (g / (hmuri))

I write the above regular expression code separately and comment it so that it looks hierarchical and convenient

\ (# ordinary character "((# grouping construction), used to define the scope of the quantifier" * "modifier (?) # name the capture group and encounter the opening parenthesis" Open "count plus 1 | # branch structure (?) # narrow balance group Encounter closing brackets "Open" count minus 1 | # branch structure [^ ()] + # any other characters other than parentheses) * # substrings appear 0 or more times (? (Open) (?)) # to determine whether there is a "Open". If there is a mismatch, nothing matches\) # General closed parentheses

For a nested structure, both the opening and ending tags are determined. for this example, the beginning is "(" and the end is ")", then the next step is to examine the middle structure, which can be divided into three categories, one is "("), the other is any character except these two characters.

Then the matching principle of the balance group is like this.

1. Find the first "(" as the beginning of the match. This is the first line above, which matches: a + (b* (celecd)) / efolf-(g / (hmuri)) * j (red display)

2. After step 1, every time a "(" is matched, an Open capture group is added to the stack, and the count is added by 1.

3. After step 1, each ")" matches the nearest Open capture group on the stack, and the count is minus 1.

That is to say, the first line above is regular "\ (") matches: a + (b* (celecd)) / efolf-(g / (hmuri)) * j (red display)

Then, match to the "(" in front of c, at this time, count plus 1; continue to match, match to the ")" after d, and calculate minus 1. Note: if the count in the stack is 0, the regular will continue to match forward, but if the match reaches ")", for example, in this example, d) (parentheses shown in red)-- the engine now hands control to (? (Open) (?!)), determines whether there is a 0 in the stack, and if it is 0, executes the matching "no" branch, because there is no "no" branch in this conditional judgment structure. So do nothing and give control to the next ")"

The regular expression "\)" matches the following), namely b) (parentheses shown in red)

4. The following (? (Open) (?)) It is used to ensure that the Open capture group count in the stack is 0, that is, "(" and ")" is paired.

5. The last ")" as the end of the match

Matching process

Match the first "(" first, and then keep matching until you hand over control to (? (Open) (?!)) when one of the following two situations occurs:

A) the Open count in the stack is 0, and ")" is encountered again at this time.

B) match to string Terminator

At this time, the control is handed over to (? (Open) (?!)) to determine whether the Open has a match. Since the count is 0 and there is no match, the "no" branch is matched. Because there is no "no" branch in the conditional judgment structure, nothing is done and the control is handed over to the next "\)".

If the situation encountered above is a), then "\)" can match the next ")", and the match is successful.

If the situation encountered above is b), backtracking is performed until the "\)" match succeeds, otherwise the entire expression matching failure is reported.

Because of the narrow balance group "(? Expression)" structure in .NET, you can dynamically count the captured groups in the stack, match to a start tag, enter the stack, count plus 1, match to an end tag, exit the stack, count minus 1, and finally determine whether there is an Open in the stack. Some indicate that the start and end tags do not match, mismatch, backtracking or report matching failure. If not, it means that the start and end tag pairs appear, and the matching of subsequent subexpressions continues.

Need to correct "(?)" To explain, it belongs to a sequential negative look, and the complete syntax is "(?! Expression)". Since the "Expression" does not exist here, it means that this is not a location, so attempts to match always fail. The purpose is to report a match failure when an Open mismatch occurs.

Here's an example:

Snhamef

The above is part of the HTML code. Now our problem is to extract the tag and delete it. In the past, our usual method is to get it directly, such as [\ s] +?, but the problem is that what we extract is not what we want, but

Snhame

The reason is also very simple, it matches the tag closest to him, but it doesn't know that the tag is not its-_ -, is it? The reason for the symbol, we removed it to make him unlimited greed, but now the problem is even bigger, it matches all the messy things.

Snhamef

This result is not what we want either. Then I'll use the "balance group" to solve it.

] * > ((?) * >) + | (?) | [\ s\ S]) *? ((mm) (?))

The result of the match is

Snhamef

That's exactly what we want.

Notice that I started to write like this.

] * > ((?) * >) + | (?) | [\ s\ S]) * (? (mm) (?!))

The result of the match is

Snhamef

A question.

The following code is just a discussion of the problem

Text content: efolf (- (g / (hmeri)) * j

Regular expression:

\ ((?) | (?) |. ) * ((mm) (?!))\)

The matching result is: (- (g / (hmuri)

The above is all the content of the article "sample Analysis of balance groups in regular expressions". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report