Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the use of greedy and non-greedy patterns in regular expressions

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly shows you "what is the use of greedy and non-greedy patterns in regular expressions". The content is simple and clear, hoping to help you solve your doubts. Let me lead you to study and learn the article "what is the use of greed and non-greed patterns in regular expressions".

1 Overview

Greedy and non-greedy patterns affect the matching behavior of sub-expressions modified by quantifiers. Greedy patterns match as much as possible on the premise of successful matching of the whole expression, but not greedy patterns on the premise of successful matching of the whole expression. as few matches as possible. Non-greedy mode is only supported by some NFA engines.

Quantifiers that belong to the greedy pattern, also known as matching priority quantifiers, include:

"{mdirection n}", "{m,}", "?", "*" and "+".

In some languages that use the NFA engine, add "?" after matching priority quantifiers to become non-greedy quantifiers, also known as ignore priority quantifiers, including:

"{m ·m} n}?", "{m,}?", "?", "*?" And "+?"

From the perspective of regular grammar, subexpressions modified by matching precedence quantifiers use greedy patterns, such as "(Expression) +", while subexpressions modified by ignored precedence quantifiers use non-greedy patterns, such as "(Expression) +?".

For greedy mode, the name of all kinds of documents is basically the same, but for non-greedy mode, some are called lazy mode or lazy mode, and some are called grudging mode. In fact, it doesn't matter what it is, as long as you master the principle and usage, you can use it freely. Individuals are accustomed to using the terms greedy and non-greedy, so this term will be used in this article.

2 principle of greedy and non-greedy pattern matching

Greedy and non-greedy patterns can be understood from the perspectives of application and principle, but if you want to really master them, you still have to understand them from the matching principle.

First from the application point of view, the answer is "what is greedy and non-greedy mode?"

2.1 Analysis of greedy and non-greedy models from the perspective of application

2.1.1 what are greedy and non-greedy models

Let's look at an example first.

For example:

Source string: aatest1bbtest2cc

Regular expression 1:. *

Match result 1: test1bbtest2

Regular expression 2:. *?

Match result 2: test1 (here refers to the first match result, so does not include test2)

Based on the above example, let's analyze what greedy and non-greedy patterns are in terms of matching behavior.

The regular expression one uses the greedy pattern, which can make the whole expression match successfully when matching to the first "", but because the greedy pattern is used, you still have to try to match to the right to see if there are longer substrings that can be successfully matched. After matching to the second "", there is no substring to the right that can be successfully matched, and the matching ends, and the matching result is "test1bbtest2". Of course, the actual matching process is not like this, the later matching principle will be described in detail.

Only from the perspective of application, we can think that greedy pattern is to match as many matches as possible on the premise of successful matching of the whole expression, that is, the so-called "greed." in popular terms, you can pick up as much as you want when you see what you want. unless there's no more.

The second regular expression uses the non-greedy pattern, which makes the whole expression match successfully when the first "" is matched. Because the non-greedy pattern is used, the matching ends and no longer tries to the right, and the matching result is "test1".

Only from the perspective of application, we can think that non-greedy mode is to match as few matches as possible on the premise of successful matching of the whole expression, that is, the so-called "non-greed". Just find what you want and pick it up, and it doesn't matter whether you don't pick it up or not.

2.1.2 description of prerequisites

In the above analysis of greedy and non-greedy patterns from an application perspective, one of the prerequisites mentioned all the time is "the whole expression matches successfully". Why should this premise be emphasized? let's take a look at the following example.

Regular expression 3:. * bb

Match result 3: test1bb

Decorate "." It is still the matching priority quantifier "*", so here is the greedy pattern, and the preceding ". *" can still match to "test1bbtest2", but since the latter "bb" does not match, the ". *" must give up the matched "bbtest2" to make the whole expression match successfully. At this point, the result of the whole expression matching is "test1bb", and the matching content of ". *" is "test1". It can be seen that under the premise of "the whole expression matching is successful", the greedy pattern really affects the matching behavior of the subexpression. if the whole expression matching fails, the greedy pattern will only affect the matching process, and the influence on the matching result is out of the question.

The same problem exists with the non-greedy model, let's take a look at the following example.

Regular expression four:. *? cc

Match result 4: test1bbtest2cc

The non-greedy mode is adopted here, and the previous ". *?" It is still matched to "test1". At this time, the later "cc" cannot match successfully and requires ". *?" You must continue to try to match to the right until the matching content is "test1bbtest2" before the following "cc" matches successfully. The whole expression matches successfully, and the matching content is "test1bbtest2cc", where ". *?" The matching content is "test1bbtest2". It can be seen that under the premise of "the whole expression matching is successful", the non-greedy pattern really affects the matching behavior of the subexpression. if the whole expression matching fails, the non-greedy pattern can not affect the matching behavior of the subexpression.

2.1.3 greed or non-greed-- the Choice of Application

Through the analysis of the application point of view, we have basically understood the characteristics of greedy and non-greedy mode, so in practical application, whether to choose greedy mode or non-greedy mode should be determined according to the demand.

For some simple requirements, such as the source character is "aatest1bb", then getting the div tag and using both greedy and non-greedy modes can achieve the desired results, which may not matter.

But for the example in 2.1.1, in practical applications, we only need to get one paired div tag at a time, that is, the content matched by the non-greedy pattern, and the content matched by the greedy pattern is usually not what we need.

Then why does the greedy pattern still exist? it is difficult to give a satisfactory answer from the point of view of application, which requires the analysis of greedy and non-greedy patterns from the perspective of matching principle.

2.2 Analysis of greedy and non-greedy patterns from the point of view of matching principle

If we really want to understand what is greedy mode, what is non-greedy mode, under what circumstances to use it, and how efficient they are, we should not only analyze them from the perspective of application, but also fully understand the matching principle of greedy and non-greedy patterns.

2.2.1 starting from the basic matching principle

NFA engine basic matching principle reference: regular basis-- NFA engine matching principle.

This paper mainly introduces the matching principles involved in greedy and non-greedy patterns. Let's take a look at the simple matching process of greedy pattern.

Source string: "Regex"

Regular expression: ". *"

Figure 2-1

Note: in order to see the matching process clearly, the gap above is larger, and the actual source string is "Regex", which is the same below.

Let's take a look at the matching process. First of all, the control is obtained by the first "", and the position 0 bit is matched. If the match is successful, the control is given to ". *".

After ".*" takes control, because "*" is a matching priority quantifier, in the case of matchable but mismatched, priority is to try to match. Try to match from "R" at position 1, match successfully, continue to match to the right, match "e" at position 2, match successfully, continue to match to the right until "" at the end of the match, the match is successful, because at this time has matched to the end of the string, so ".*" ends the match and gives control to the last "of the regular expression.

After "has acquired control, because it is already at the end of the string, the match fails, and the state that can be traced back is looked forward. The control is handed over to". * ", which gives up a character, that is, the"at the end of the string, and then gives the control to the"at the end of the regular expression."matches the"at the end of the string.

At this point, the entire regular expression matches successfully, where the content of the ". *" match is "Regex", and a backtracking is performed during the matching process.

Next, take a look at the simple matching process of non-greedy patterns.

Source string: "Regex"

Regular expression: ". *?"

Figure 2-2

Take a look at the matching process of non-greedy patterns. First of all, the control is obtained by the first "", and the position 0 bit is matched. If the match is successful, the control is handed over to ". *?".

". *?" After gaining control, due to "*?" Is to ignore the priority quantifier, in the case of matching but not matching, first try to mismatch, because "*" is equivalent to "{0,}", so in the case of ignoring priority, you can not match anything. Try to ignore the match from position 1, that is, do not match anything, and give control to the last "of the regular expression.

"" after gaining control, try to match from position 1, match the "R" at position 1, the match fails, look forward for the status that can be traced back, and give the control to ". *?" by ". *?" Eat a character, match the "R" at position 1, and give control to the last "of the regular expression.

After gaining control, try to match from position 2, match the "e" at position 1, fail to match, look forward for a state that can be traced back, and repeat the above process until the ". *?" Match to "x", and then give control to the last "of the regular expression.

After gaining control, try to match from position 6, and match the last "" of the string by "". The match is successful.

At this point, the whole regular expression matches successfully, where ". *?" The content of the match is "Regex", and the matching process is traced back five times.

2.2.2 greed or non-greed-- the Choice of matching efficiency

Through the analysis of the matching principle, we can see that when the matching is successful, the greedy pattern has less backtracking, while the backtracking process requires the transfer of control, giving up the matched content or matching unmatched content, and try to match again, which reduces the matching efficiency to a great extent, so greedy mode has an advantage in matching efficiency compared with non-greedy mode.

But the example in 2.2.1 is just a simple application, and when readers see it, will there be any doubt that greedy patterns are necessarily more efficient than non-greedy pattern matching? The answer is no.

For example:

Requirements: get the substrings of two "", which can no longer contain "".

Regular expression one: ". *"

Regular expression 2: ". *?"

Case 1: when greedy patterns match more unwanted content, there may be more backtracking than non-greedy patterns. For example, the source string is "The word" Regex "means regular expression.".

Case 2: the greedy model cannot meet the demand. For example, the source string is "The phrase" regular expression "is called" Regex "for short."

For case 1, the greedy pattern adopted by regular expression 1, ". *" will match until the end of the string, and the control will be handed to the last "". After the match is not successful, backtracking will be carried out. Because the content of multiple matches "means regular expression." far exceeds the content that needs to be matched, the matching efficiency of using regular expression for a while will be lower than that of non-greedy pattern using regular expression 2.

For case 2, the regular expression matches to "" regular expression "is called" Regex ", and even the requirements are not met, so there is no question of matching efficiency.

The above two situations are common, so do we have to use the non-greedy mode in order to meet the demand and take into account efficiency? Of course not. According to the actual situation, changing the sub-expression modified by matching priority quantifiers can not only meet the demand, but also improve the matching efficiency.

Source string: "Regex"

Give the regular expression three: "[^"] * "

Take a look at the matching process of regular expression three.

Figure 2-3

First of all, the control is obtained by the first "", which matches the "" of the position 0. If the match is successful, the control is handed over to "[^"] * ".

After "[^"] * "takes control, because" * "is a matching priority quantifier, it is a priority to try to match if it can be matched but not matched. Try to match from the "R" at position 1, match successfully, continue to match to the right, match the "e" at position 2, match to the right, continue to match to the right, until the match reaches "x", and then match the ending "". When the match fails, the control is handed over to the last "of the regular expression.

After "" takes control, matches the "" at the end of the string, and the match is successful.

At this point, the whole regular expression matches successfully, where the matching content of "[^"] * "is" Regex ", and there is no backtracking during the matching process.

The quantifier-modified subexpression is changed from a wide range of "." to the exclusive character group "[^"] ", which still uses greedy mode, which perfectly solves the problem of demand and efficiency. Of course, since there is no backtracking in this matching process, there is no need to record the backtracking state, so you can use solidified grouping to further optimize the rules.

Give the regular expression four: "(? > [^"] *) "

Solidified grouping is not supported in all languages, such as .NET, but not in Java, but a simpler priority quantifier can be used in Java instead of "[^"] * + ".

(3) greedy or non-greedy mode-- on matching efficiency

Generally speaking, greedy and non-greedy patterns, if the quantifier modifies the same sub-expressions, such as ". *" and ". *?", their application scenarios are usually different, so they are generally not comparable in efficiency.

For subexpressions modified by quantifiers to meet the requirements, such as changing ". *" to "[^"] * ", because the modified subexpressions are different, they do not have direct comparability. However, when the same subexpressions can meet the requirements, such as" [^ "] *" and "[^"] *? ", the matching efficiency of greedy patterns is usually higher.

At the same time, there is also a fact that the non-greedy mode can be realized, and the greedy pattern modified by the quantifier can be realized, while some of the optimization effects that can be achieved by the greedy pattern may not be realized by the non-greedy mode.

Greedy mode also has the advantage that when matching fails, greedy mode can report failures more quickly, thus improving the efficiency of matching. The following is a comprehensive examination of the matching efficiency of greedy and non-greedy patterns.

3.1 efficiency improvement-Evolution process

After understanding the basic principles of the matching of greedy and non-greedy patterns, let's take a fresh look at the evolution of regular efficiency improvement.

Requirements: get the substrings of two "", which can no longer contain "".

Source string: The phrase "regular expression" is called "Regex" for short.

Regular expression one: ". *"

The content of a regular expression match is "" regular expression "is called" Regex "", which does not meet the requirements.

Put forward the regular expression 2: ". *?"

First of all, the control is obtained, and the match starts with the position 0 bit, until the position 11 matches successfully, and the control is handed over to the ". *?". The matching process is the same as the matching process of the non-greedy pattern in 2.2.1. " . *? "the content of the match is" Regex ", and four backtracking is performed during the matching process.

How to eliminate the loss of matching efficiency caused by backtracking is to use a smaller range of sub-expressions and adopt greedy patterns to propose regular expression three: "[^"] * "

First of all, "acquires control, and tries to match from position 0 to position 11, and the control is handed over to" [^ "] *". The matching process is the same as the non-greedy pattern matching process in Section 2.2.2. "[^"] * "matches the content as" Regex ", and there is no backtracking during the matching process.

3.2 efficiency gains-faster reporting of failures

The above discussion is about the evolution process of successful matching, and for a regular expression, if the matching failure is reported as quickly as possible, it will also improve the matching efficiency, which is perhaps the easiest thing to ignore in the process of designing regular expressions. In the case of a large amount of source string data or complex regular expressions, whether the matching failure can be reported quickly will have a direct impact on the matching efficiency.

Next, we will build a regular expression that fails to match and analyze the matching process.

In the analysis of the following matching process, the source string is: The phrase "regular expression" is called "Regex" for short.

3.2.1 Analysis of failure process of non-greedy pattern matching

Figure 3-1

Build a regular expression that matches a non-greedy pattern that fails: ". *?" @

Due to the existence of the last "@", the regular expression must have failed in the end, so take a look at the matching process.

First of all, the control is obtained by "", and the match is attempted at position 0, and the match fails until the match at A marked in the figure is successful, and the control is handed over to ". *?".

". *?" After gaining control, try to match from the position after A, because it is a non-greedy mode, first ignore the match, hand over the control to "", and record the backtracking status. "" After gaining control, try to match from the position after A, match the character "r" fails, find the status that can be traced back, and hand over the control to ". *?" by ". *?" Matches the character "r". Repeat the above process until. *? Matches the preceding character "n" at B, "" matches the character at B "", and gives control to "@". The "@" matches the next space "", the match fails, the status that can be traced back is found, and the control is handed over to ". *?" by ". *?" Match spaces. Continue to repeat the above matching process until the ". *?" Match to the end of the string and hand over control to "". Because it is already the end of the string, the match fails, reporting that the entire expression failed to match at position 11, and a round of matching attempts ended.

The regular engine drive drives the regular forward to move on to the next round of attempts. The subsequent matching process is basically similar to that of the first round of attempts, as shown in figure 3-1.

From the matching process, we can see that almost every step of the matching failure process of non-greedy patterns is accompanied by the backtracking process, which has a great impact on the matching efficiency.

3.2.2 Analysis of the failure process of greedy pattern matching-- large-scale subexpression

Figure 3-2

PS: the above diagram of the analysis process refers to the illustration of the relevant chapters in the book "mastering regular expressions."

Build a regular expression that matches a greedy pattern that fails: ". *" @

Among them, the sub-expression modified by quantifier is the "." with a wide range of matching. Due to the existence of the final "@", this regular expression must fail in the end. Take a look at the matching process.

First of all, the control is obtained by "", and the match is attempted at position 0, and the match fails until the match at A marked in the figure is successful, and the control is handed over to ". *".

". *" after gaining control, try to match from the position after A, because it is a greedy pattern, optimize the attempt to match, match all the way to the end of the string, and give the control to "." After obtaining control, because it is already the end of the string, the match fails, finds the status that can be traced back, and gives the control to ". *" to give up the matched character ".". Repeat the above process until the following "" matches the following character "" at C, and hand over control to "@". The "@" matches the space "" at the next D, the match fails, the status that can be traced back is found, the control is given to ". *", and the matched text is given out by ". *". Continue to repeat the above matching process until all matched text is given to I by ". *" and control is handed over to "." The match failed, and since there is no state to be traced back, the entire expression is reported to have failed to match at position 11, and a round of matching attempts ends.

The regular engine drive drives the regular forward to move on to the next round of attempts. The subsequent matching process is basically similar to that of the first round of attempts, as shown in figure 3-2.

From the matching process, we can see that the matching failure process of greedy patterns with large-scale sub-expressions is no different from that of non-greedy patterns, and the final number of backtracking is basically the same as that of non-greedy patterns. it still has a great impact on matching efficiency.

3.2.3 Analysis of the failure process of greedy pattern matching-- improved sub-expression

Figure 3-3

Build a regular expression that matches a greedy pattern that fails: "[^"] * "@

Among them, the sub-expression modified by quantifier is changed to match the excluded character group "[^"] with a small range. Due to the existence of the final "@", this regular expression must fail in the end. Take a look at the matching process.

First of all, the control is obtained by "", and the match is attempted at position 0, and the match fails until the match at A marked in the figure is successful, and the control is handed over to "[^"] * ".

"[^"] * "after gaining control, try to match from the position after A, because it is greedy mode, try to match first, match all the way to B, and hand over the control to". "" match the next character "", match successfully, hand over the control to "@". The "@" matches the next space "", the match fails, the status that can be traced back is found, the control is given to "[^"] * ", and the matched text is given out by" [^ "] *". Continue to repeat the above matching process until "[^"] * "gives all the matched text to C and gives control to". The match failed, and since there was no state to be traced back, it was reported that the entire expression failed to match at position 11, and a round of matching attempts ended.

The regular engine drive drives the regular forward to move on to the next round of attempts. The subsequent matching process is basically similar to that of the first round of attempts, as shown in figure 3-3.

From the matching process, we can see that the matching failure process of greedy patterns using excluded character groups, on the whole, greatly reduces the number of backtracking per round, which can effectively improve the matching efficiency.

3.2.4 Analysis of the failure process of greedy pattern matching-solidified grouping

As can be seen from the analysis in Section 3.2.3, because the exclusive character group is used in "[^"] * ", the character matched between An and B in figure 3-3 must not be the character". So the backtracking process between B and C is redundant, that is, the traceable state in between can not be recorded at all. Solidified grouping can be used in .NET, and priority quantifiers can be used in Java to achieve this effect.

Figure 3-4

First of all, the control is obtained by "", and the match is attempted at position 0, and the match fails until the match at A marked in the figure is successful, and the control is handed over to "(? > [^"] *) ".

"(? > [^"] *) "after gaining control, try to match from the position after A, because it is greedy mode, try to match first, always match to B, hand over the control to", during this matching process, do not record any status available for backtracking. "" match the next character "", match successfully, hand over the control to "@". The "@" matches the next space "", the match fails, and the state that can be traced back is found. Since there is no state to be traced back, the whole expression is reported to have failed to match at position 11, and a round of matching attempts ends.

The regular engine drive drives the regular forward to move on to the next round of attempts. The subsequent matching process is basically similar to that of the first round of attempts, as shown in figure 3-4.

From the matching process, we can see that the matching failure process of greedy pattern using solidified grouping does not involve backtracking, which can maximize the matching efficiency.

3.3 conversion from non-greedy mode to greedy mode

When using sub-expressions with a wide range of matching, the contents of greedy patterns and non-greedy patterns will be different, but by optimizing sub-expressions, non-greedy patterns can achieve matching, greedy patterns can be achieved.

For example, in practical applications, match the content of the img tag.

For example:

Requirement: get the address of the picture in the img tag, and fix it to "" after src=

Source string:

Regular expression one:

In the matching result, the content of capture group 1 is the image address. As you can see, the non-greedy mode is used in this example, and according to the analysis of the above section, the latter two non-greedy modes can use exclusive character sets to convert the non-greedy mode into the greedy mode.

Regular expression 2: "between attributes, the character" > "may also appear, but that is an extreme case and will not be discussed here.

The last two non-greedy patterns can be converted into greedy patterns by excluding character groups to improve matching efficiency, while the non-greedy patterns before "src=" cannot use exclusive character groups because they want to exclude a character sequence "src=" rather than a single one or more characters. Of course, there is no way, you can use sequential look to achieve this effect.

Regular expression 3: # start marking ""

(? > # grouping construction to limit the scope of the quantifier "*" modification

] * > (?) # name the capture group, encounter the start tag, enter the stack, add 1 to the Open count

| # Branch structure

(?) # narrow balance group, encounter end mark, out of stack, Open count minus 1

| # Branch structure

(?)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report