In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article focuses on "A detailed introduction to regular expressions of java recursion". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "A detailed introduction to regular expressions of java recursion".
In general, recursive regular expressions are used to match any nested hierarchical structure or left-right symmetric structure. For example, match:
() hello (world) good (boy) bye)
Hello world hello world
Abc.def.ghij...stu.vwx.yzabcdcba123454321
Recursive regularization is a more flexible part of regular expressions, in other words, it may be difficult. The following regular expression is a very widespread recursive regular example circulating on the Internet. It is used to match parentheses nested any number of times, and there can be other characters inside the parentheses, such as (a (bc) de), (abc (bc (def) c) de).
# uses the x modifier to ignore the white space symbol /\ (((? > [^ ()] +) | (\ g)) *\) / x in the regular expression
You don't seem to understand this? In fact, even if I know the way of regular recursion, it is still difficult to understand (at least, I have analyzed it for a long time).
The reason why it is difficult to understand is probably because the solidified grouping used here belongs to a technical way of writing in multi-selected branches | and the quantifier * is used outside the grouping, which is too difficult to understand.
It is precisely because this example is circulating all over the Internet that I have been discouraged from learning recursive regularities many times. I am not going to explain the meaning of this recursive regular here, because it is "too academic" or "too pretending to be xyx", while the general recursive regular can be written simply but can achieve the goal.
How to write an easy-to-understand version of recursive regularization and understand the matching way of recursive regularization is the goal of this paper. Later, I introduced a simpler and easier-to-understand version that can also implement this recursive matching requirement.
In order to clearly explain the recursive regularity, this paper will gradually go deep into all aspects of the recursive regularity in a step-by-step way. Therefore, the space may be a little larger, with a large amount of space devoted to explaining and analyzing how recursive regularities match recursively.
Note:
In this paper, Ruby's regular expression is used to introduce recursive regularities, but it can also be used in other languages that support recursive regularities. For example, Perl, PHP, Python (the native re does not provide it, but the third-party library regex provides recursive regularity), and so on.
Understand backreferences\ N and\ g
First of all, the usage of recursive regular expression is introduced step by step through the use of reverse reference of regular expression.
The regular expression (abc | def) and\ 1xyz can match the string "abc and abcxyz" or "def and defxyz", but not "abc and defxyz" or def and abcxyz. This is because, when referencing, a backreference can only refer to the result after the previous packet capture was successful.
Reg = / (abc | def) and\ 1xyz/reg = ~ "abc and abcxyz" # = > 0reg = ~ "def and defxyz" # = > 0reg = ~ "def and abcxyz" # = > nilreg = ~ "abc and defxyz" # = > nil
However, if you use\ g instead of\ 1, you can match the strings in these four cases (using (? 1) in Perl corresponds to\ g here):
Reg = / (abc | def) and\ gxyz/reg = ~ "abc and abcxyz" # = > 0reg = ~ "def and defxyz" # = > 0reg = ~ "def and abcxyz" # = > 0reg = ~ "abc and defxyz" # = > 0
The difference between\ g and\ 1 is that:\ 1 when backreferencing, it refers to the result value captured by the packet, while\ g is not backreferenced, but directly reperforms the matching operation of capturing the packet with the index number 1. Equivalent to / (abc | def) and (abc | def) xyz/.
So,\ 1 is equivalent to inserting the result of a packet capture with index number 1 at the referenced position, and\ g is equivalent to inserting a packet capture expression with index number 1 here so that it can match this part of the grouping expression again.
If the packet capture expression is regarded as the definition of the function, then the start of the match means that the function is called for packet capture. The reverse reference\ N inserts the return value of the function at the reference location, and\ g means that the function is called here again to match.
The name of\ g can be a numeric grouping index, a name index of a named capture, or 0 for the entire regular expression itself.
/ (abc | def) and\ gxyz// (? abc | def) and\ gxyz// (abc | def) and\ gxyz/ # error regular, later analysis = begin# Perl, Python (regex, not re), PHP corresponding mode:\ g-> (? r) or (? 0)\ g-> (? n)\ g-> (? P > name) or (? & name) = end
The first two are easy to understand, but the third is less understandable when using\ g. Move on.
A preliminary study of Recursive regularities: what Recursive regularities match
\ g represents the regular expression itself, so this is equivalent to a recursive regular expression, or if you do the first round of regular expression replacement, it is equivalent to:
/ (abc | def) and (abc | def) and\ gxyzxyz/
Of course, the\ g is replaced with a regular expression only to help understand, but it doesn't really directly replace the definition of the regular expression. Just like when a function is called, it will not be replaced with the code in the function definition at the place where the function is called and then executed, the function definition can be reused many times.
In any case, it is not difficult to find the possibility of infinite recursion here, because the regular expression after one round of replacement contains\ g again, and it can do the second round of replacement and the third round of replacement again.
So, for the recursive regular expression / (abc | def) and\ gxyz/, what kind of string can it match? This is the most important thing to care about when understanding regular recursion.
You can think of the above\ g as a placeholder. First of all, it can match a string in the format "abc and _ xyz" or def and _ xyz, where I use _ for the\ g placeholder. Recursive round, it can match "abc and def and _ xyzxyz", here will continue recursion, will be endless. So leave the question of what string the regular matches first, and then analyze it later.
In fact, / (abc | def) and\ gxyz/ is the wrong regular expression, which prompts us that recursion has no end:
/ (abc | def) and\ gxyz/#= > SyntaxError: never ending recursion
Therefore, the use of recursive regularization must ensure that recursion has an end point.
Guarantee the end point of regular recursion
How to guarantee the end point of recursive regularization? Just give the\ g part a qualifier, such as:
\ g + # error regular\ g {3} # error regular\ g {, 3} # error regular\ g * # correct regular\ g? # correct regular\ g {0} # correct regular pat |\ g # correct regular (\ g) * # correct regular (\ g)? # correct regular.
\ g + means at least one round of recursion, but this is already wrong, because when recursive multiple times, the placeholder and its quantifier + will always remain in the result of the last round, resulting in infinite recursion. Similarly,\ g {3}, which means strict recursion three times, is also wrong, because the\ g {3} placeholder and its quantifier {3} are still retained after the third recursion, which will also be infinitely recursive.
So, only\ g * and\ g? And\ g {0} and pat |\ g, which can express recursion 0 times in the sense of quantifier number choice, is the correct regular expression syntax, because no matter how many times of recursion, the quantifier of the last placeholder can be 0 times, thus reaching the end of recursion, that is, stopping recursion.
So, modify the previous regular expression, if you use? Quantifier modifier\ g:
/ (abc | def) and\ g?xyz/
Re-explore Recursive regularities: what Recursive regularities match
Going back to the previous question, what kind of string can this correct recursive regular expression / (abc | def) and\ g?xyz/ match now?
According to the previous analysis, the pattern of the string it can match is similar to abc and _? xyz or def and _? xyz.
What if the quantifier? 0 times, then the recursive regular match is "abc and xyz" or "def and xyz":
Reg = / (abc | def) and\ g?xyz/reg = ~ "abc and xyz" # = > 0reg = ~ "def and xyz" # = > 0
What if the quantifier? Once, then the regular pattern after the recursive round is abc and abc and _? xyzxyz, and any replacement of "abc" with "def" satisfies the condition. Then there is the question of how to choose the number of quantifiers.
What if the quantifier here? Take 0 times, that is, the overall recursion round from the beginning to the present. Then the recursive regular match is:
Reg = / (abc | def) and\ g?xyz/reg = ~ "abc and abc and xyzxyz" # = > 0reg = ~ "abc and def and xyzxyz" # = > 0reg = ~ "def and def and xyzxyz" # = > 0reg = ~ "def and abc and xyzxyz" # = > 0
What if the quantifier after a round of recursion? Why don't you take it again? Then the next round of recursion will still have the problem of choosing the number of quantifiers.
At this point, you should understand the basic matching of recursive regularities. However, the return of\ g used here is very basic, and the following will continue to go further step by step.
Deep recursion (1):\ g in parentheses
In the previous recursive example, the\ g part of the expression that can represent recursion is placed outside the group. in this case, only the form\ g can be considered recursive, and if it is\ g or\ g, it is not recursive, at best, it is a call to an expression.
However, when you need to use recursive regularities to solve problems, recursive expressions tend to be inside the group rather than outside the group. Therefore, the recursion explained above is actually very rare. Therefore, to use recursive regularity, we have to continue to explore in depth.
First, look at a very simple recursive regular expression within a group:
/ (abc\ g?xyz) + /
In this expression, a packet capture is performed, which first matches the abc characters, and then uses the expression\ g? Pay attention to this? It can't be less, of course? It can also be replaced with other quantifiers explained earlier), followed by the matching character xyz. Because of\ g here? It is placed inside the packet capture corresponding to index 1, so a recursive regular expression is formed.
The question is, what kind of string does this regular expression match? To learn a recursive regular expression, you must be able to analyze what type of string it can match.
Still,\ g is represented as a placeholder, then the string pattern matched by the recursive regular expression is: "abc_?xyz" * N, which means repeated N times, because there is a + sign outside the parenthesis grouping of this expression.
What if the quantifier? If you choose 0 times, that is, without recursion, the string "abcxyz" * N is matched:
/ (abc\ g?xyz) + / = ~ "abcxyz" # = > 0 / (abc\ g?xyz) + / = ~ "abcxyzabcxyz" # = > 0 / (abc\ g?xyz) + / = ~ "abcxyzabcxyzabcxyz" # = > 0 / (abc\ g?xyz) + / = ~ "abcxyz" * 10 = > 0
What if the quantifier? Select once, then after a round of recursion, the matching string pattern is: "abcabc_?xyzxyz" * N. Again? If you choose the number of quantifiers, if you choose 0 times, the matching string is "abcabcxyzxyz" * N:
/ (abc\ g?xyz) + / = ~ "abcabcxyzxyz" # = > 0 / (abc\ g?xyz) + / = ~ "abcabcxyzxyzabcabcxyzxyz" # = > 0 / (abc\ g?xyz) + / = ~ "abcabcxyzxyz" * 3percent = > 0
Continue to analyze another round of recursion. Suppose this is? Select the quantifier once, then the second round of recursion, the matching string pattern is: "abcabcabc_?xyzxyzxyz" * N.
At this point, it should not be difficult to guess the pattern of the string that matches the recursive regular expression / (abc\ g?xyz) + /:
After summarizing "abcxyz" * N "abcabcxyzxyz" * N "abcabcabcxyzxyzxyz" * N#, it matches the following general patterns: n and N are both greater than or equal to 1 ("abc" * n + "xyz" * n) * N
Focus on the recursive regular expression / (abc\ g?xyz) + /, from which you can directly guess what type of string to match?
Instead of looking at quantifiers + or other possible quantifiers, focus on grouping capture. The packet capture matches abc_?xyz. If you want to do recursive N rounds, then each round is an abc_?xyz pattern. Replace it directly into the regular to observe: abc (abc_?xyz) * xyz, where (abc_?xyz) * indicates that this part is repeated 0 or N times. Of course, this part of the replacement is not the standard rule, only to help understand the concept of different places mixed together, I do not think it will cause ambiguity to your understanding.
It is not difficult to understand this way. Of course, this recursive regular is relatively simple, if you put the above\ g? If you change it to\ gwatch, it will look a little more complicated. So what kind of string does it match?
In the same way of analysis, / (abc\ g*xyz) + / is regarded as the structure of "abc_*xyz" * N, and then the value of * is assumed to be 3 times, so the recursive result looks similar to:
"abc (abc_*xyz) (abc_*xyz) (abc_*xyz) xyz" * N
The quantifier * can be selected in each of the parentheses above, but in order to reach the end of recursion, the * in each recursion must take a value 0 times to end the recursion.
So, if each of the three parentheses is now selected 0 times, the matching string pattern is similar to:
"abc (abcxyz) (abcxyz) (abcxyz) xyz" * N# is equivalent to: n and N are both greater than or equal to 1 ("abc" + "abcxyz" * n + "xyz") * N
For example:
/ (abc\ g*xyz) + / = ~ ("abc" + "abcxyz" * 1 + "xyz") * 1percent = > 0 / (abc\ g*xyz) + / = ~ ("abc" + "abcxyz" * 1 + "xyz") * 2percent = > 0 / (abc\ g*xyz) + / = ~ ("abc" + "abcxyz" * 4 + "xyz") * 2percent = > 0
If the * in the first parenthesis in the above three parentheses is valued once, and the * in the last two parentheses is valued 0 times, then after recursion again, the matching string pattern is similar to:
"abc (abc (abc_*xyz) xyz) (abcxyz) (abcxyz) xyz" * N
Yes, we have to choose the number of quantifiers again. If * is taken 0 times this time, the recursive matching will be terminated, and the string pattern of the match will be:
"abc (abc (abcxyz) xyz) (abcxyz) (abcxyz) xyz" * N
So if * is not selected according to the number of times above, what is the matching string pattern?
Without an answer, the only accurate answer is to return to the meaning of the regular expression: the string pattern it matches is (abc\ g*xyz) +.
In-depth recursion (2): write recursive regularities (getting started)
It has always been based on a given recursive regular expression to analyze what strings can be matched, which is helpful to understand recursive regularities. But what we want to know more is how to write recursive regular expressions based on strings.
In general, to use recursive regularities to match, you often have to match something nested, and if you don't match nested content, you probably won't think of using recursive regularities. Here, suppose you also want to match nested things.
Let's start with simple nesting. For example, how to match infinitely nested empty parentheses (), (()), (()), that is, "(" * n + ")" * n?
Analyze it. If it is not recursive, it matches a pair of parentheses (), so the two parenthesis characters must be in the group, that is, (\). (if you use\ g for recursion, you don't have to be in the group, but don't consider this here.)
According to the previous analysis of which string the recursive regular expression matches, using placeholders instead of recursive, the string pattern of nested parentheses to match looks like this: (_). So the recursive expression\ g should be between\ (and\), that is, (\ (\ g\)).
There is also a missing quantifier to ensure the end of recursion. So what kind of quantifier do you use?
Using\ g * is certainly no problem, as long as * only selects quantifiers once for each recursion, and the final recursion is 0 times for the last round of recursion, then the matching pattern is ((_ *)), (), and so on. This is exactly in line with nested matching.
/ (\ (\ g*\)) / = ~ "(" * 1 + ")" * 1percent = > 0 / (\ (\ g*\)) / = ~ "(" * 3 + ")" * 3percent = > 0 / (\ (\ g*\)) / = ~ "(" * 10 + ")" * 10percent = > 0
If you look at the recursive regularities written by others, you will often add a * quantifier after grouping, that is, (\ (\ g*)) *. For the nesting of this pattern, this * is actually superfluous. In order to match successfully, this quantifier must only be selected 0 or once. If you select more than once, the matching string pattern becomes "(_ *)" * N, and the more standard representation is ("(" * n + ")" * n) * N. Of course, as mentioned earlier, there are countless other matching possibilities.
So, here I don't add quantifiers like * or + at the end of the group. We should continue the discussion just now.
Use\ g? Is this quantifier all right? Of course, the above analysis of\ g* means that when the * number of choices for each round of recursion is 1 or 0, it can match infinitely nested parentheses. For\ g? Of course it's okay to say it, because? It can also represent 0 or 1 time.
/ (\ (\ g?\)) / = ~ "(" * 1 + ")" * 1percent = > 0 / (\ (\ g?\)) / = ~ "(" * 3 + ")" * 3percent = > 0 / (\ (\ g?\)) / = ~ "(" * 10 + ")" * 10percent = > 0
Both recursive regular expressions meet the requirements and can match infinitely nested parentheses.
The following is the named capture version:
/ (?\ (\ g?\)) / = ~ "(" * 3 + ")" * 3 "= > 0
You can also use\ g as a nested expression directly, and you can even remove the grouping:
/ (?\ (\ g?\)) / = ~ "(" * 3 + ")" * 3percent = > "remove the grouping and directly recurse this kind of itself / (\ g?\) / = ~" ("* 3 +") "* 3percent = > 0
In this way, it seems that it is not difficult to write recursive regularities. In fact, the simple recursive regularity of nested patterns is not difficult, as long as you understand the meaning of recursion, you can basically write it. Look at another example.
In-depth recursion (3): write recursive regularities (advanced)
Suppose the string pattern to match is: (abc (d (xy) e) fgh), where the characters in each parenthesis are arbitrary in length. This seems to be the example given at the beginning of this article.
This recursion is actually very simple to write:
# for readability The x modifier ignores the white space symbol in the expression /\ ([^ ()] *\ g* [^ ()] *\) / x# match: reg = /\ ([^ ()] *\ g * [^ ()] *\) / xreg = ~ "(abc (d (xy) e) fgh)" # = > 0reg = ~ "(abc (d (xy)" # = > 0reg = ~ "(() e) ) "# = > 0reg = ~" () "# = > 0
Where\ ([^ ()] * and [^ ()] *\) are heads and tails, using\ g to infinitely nest headers and tails. The logic is actually very simple.
Compared with the version /\ ((? > [^ ()] +) | (\ g)) * / x circulated on the Internet, the writing given here should be much easier to understand.
Going back to extending the recursive matching requirement, what if the string you need to match is the pattern ab (abc (d (xy) e) fgh) df? Another question is, what's the difference between this string pattern and (abc (d (xy) e) fgh)?
If you compare (abc (d (xy) e) fgh) with left and right parentheses, it happens to be logarithmic: (abc (d (xy) e) fgh) (here separated by a space, paired from inside to outside). But when ab (abc (d (xy) e) fgh) df is paired in left and right parentheses, you get ab (abc (d (xy) e) fgh) df, which obviously has a layer of content xy that cannot be paired.
In order to write recursive expressions that are divided into pairs, forget about the layer of xy that cannot be paired. Then the corresponding recursive regular expression is:
/ [^ ()] *\ (\ g*\) [^ ()] * / x
Where [^ ()] * (is the head,\) [^ ()] * is the tail, and the middle uses\ g* to achieve infinite nesting of head-to-tail pairs.
Then consider the part of unpaired xy that comes out in the middle. In fact, it doesn't matter if you put this part directly on the left or right side of\ g*. For example:
# to the left of\ g* / [^ ()] *\ ([^ ()] *\ g*\) [^ ()] * / x# to the right of\ g* / [^ ()] *\ (\ g* [^ ()] *\) [^ ()] * / x
Yes, it's that simple and rude to write recursive regular expressions.
It's just that the reality is not so good, it puts the extra unmatched parts on the left or right side of the recursive expression, but sometimes that doesn't work.
A more general way to solve the problem of unpaired content is to use a branch structure that chooses one of the two, that is, | used in conjunction with recursive expressions, see the next section.
In-depth recursion (4): recursive combination selects one of the two branches
To deal with the extra data that cannot be paired above, you can choose one of the two structures | rewrite it in a more general way as follows:
/ [^ ()] *\ (\ g*\) [^ ()] * |. / x
Perform a matching test:
Reg = / [^ ()] *\ (\ g*\) [^ ()] * |. / xreg = ~ "ab (abc (d (xy) e) fgh) df" # = > 0
When the recursive regular expression combines the function provided by | to choose one of the two branches, the left or right side (opposite to\ g) can be used to provide these "orphan" data.
For example, in the above example, when recursion finds that this part of xy is redundant, it will not be possible to match this extra data from the other branch of one of the two.
But choosing one of the two branches brings a new problem: as long as there is no match, you can go to the other branch to match. If the branch on the right is a.., this is equivalent to one more universal box, and everything can be matched from here.
But what if the extra characters that cannot be matched are the necessary characters such as the closing parenthesis or the left parenthesis? Without any parenthesis, it is no longer a pair of nested structures, but it matches successfully by choosing one of the two branches.
How to solve this problem? First, it is necessary to ensure that the other branch is not omnipotent. Second, the whole structure needs to be anchored. For example:
/\ A ([^ ()] *\ (\ g*\) [^ ()] * | [^ ()])\ ZUnix
Notice that the parentheses are grouped above, so\ g changes to\ g, because the anchor does not need to be included in recursion.
Of course, the other branch selected from one of the two branches in the above example uses a single character match [^ ()], which causes the branch to be selected multiple times if there are multiple consecutive extra characters. To reduce the number of matching tests, you can write it directly as [^ ()] *.
/\ A ([^ ()] *\ (\ g*\) [^ ()] * | [^ ()] *)\ ZUnix
However, this may lead to a lot of backtracking when the match fails, resulting in a sharp decline in performance. For example, the following failed match:
Reg = /\ A ([^ ()] *\ (\ g*\) [^ ()] * | [^ ()] *)\ Z/x# matching failure performance degradation (st=Time.now); (reg = ~ "ab (abc (d (xy) e) fghdf"); (Time.now-st) # = > 1.7730072 (st=Time.now); (reg = ~ "ab (abc (d (xy) e) fghdffds") (Time.now-st) # = > 47.585805 match does not affect (st=Time.now); (reg = ~ "ab (abc (d (xy) e) fgh) df"); (Time.now-st) # = > 5.9e-06
From the results, it is found that for such a short string, the first match failure takes 1.8 seconds, and the second string is even more exaggerated, with only three more characters, and the time taken soars to 47 seconds.
There are many solutions, and two are provided here: one is to move the * sign directly outside the packet, which is not equivalent, but it does not affect the final matching result; the other is to use the multi-selected branch to solidify the grouping or take precedence mode.
Reg1 = /\ A ([^ ()] *\ (\ g*\) [^ ()] * | [^ ()]) *\ Z/xreg2 = /\ A ([^ ()] *\) [^ ()] * | (? > [^ ()] *))\ Z/x# match (st=Time.now); (reg1 = ~ "ab (abc (d (xy) e) fgh) df") (Time.now-st) # = > 6.1e-06 (st=Time.now); (reg2 = ~ "ab (abc (d (xy) e) fgh) df"); (Time.now-st) # = > 5.8e-06# matching failure (st=Time.now); (reg1 = ~ "ab (abc (d (xy) e) fghdf"); (Time.now-st) # = > 8.46e-05 (st=Time.now) (reg2 = ~ "ab (abc (d (xy) e) fghdf"); (Time.now-st) # = > 0.0004223
Deep recursion (5): beware of packet capture in recursion
Before introducing the example, verify the conclusion.
In the process of recursion, there may also be expressions for packet capture, so the value of the relevant variable set by the recursive regularization is the corresponding state of the last packet capture. For example:
Reg = / (abc | def) and\ g?xyz/# recursive round only reg = ~ "abc and def and xyzxyz" # = > match $~ indicates all the strings matched this time $~ # = > # # $1 indicates the content corresponding to the first packet capture $1 # = > "def"
As you can see from the above results, during recursion, the last round of recursion (in this example, the first round of recursion) sets some variables for regular matching, which overrides the results of the recursive settings that precede it.
Let's look at another example. Now there is a need to match palindromes of any length (palindrome), such as 1234321, abcba, good, abccba, good, good, 123321. This example can only be implemented using one of the two branches.
Here's a brief analysis of how to implement this requirement through recursive regularization.
Suppose the string to be matched is abcdcba, and the extra character d is removed first, then the string to be matched is abccba, which is also a string pattern we want to match. First of all, the left and right pairing parts must be exactly the same data, this recursive regularity is actually easy to implement, described by placeholders, the approximate pattern is: (.) _ *\ 1. Replace it with a recursive regular expression:
/ (.)\ g*\ 1Zero x
Then consider the extra character and put it directly in the other branch of one of the two branches: because you choose one of the two branches, the\ g here can guarantee the end point of the recursion without quantifier modification.
/ (.)\ g\ 1 |. / x
Finally, add position anchoring.
/\ A ((.)\ g\ 2 |.)\ ZUnix
There seems to be no problem, go to test the match:
/\ A ((.)\ g\ 2 |.)\ Z abcba x = ~ "abcba" # = > nil
As a result, it was not as successful as I thought.
However, there is nothing wrong with the logic of this regular expression. For example, use grep-P (using PCRE) to perform equivalent regularities to match palindromes.
$grep-P "^ ((.) (? 1)\ 2 |.) $"
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.