How to master regular expressions 04/16 Update SLTechnology News&Howtos

How to master regular expressions

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to master regular expressions". The content of the explanation is simple and clear, and it is easy to learn and understand. let's follow the editor's train of thought to study and learn "how to master regular expressions".

1. Regular common rule 1.1 character matching

Character description\ escape character\ d [0-9]. It means it's a number. \ d [^ 0-9]. Represents any character except a number. \ w [0-9aMuzAMuZZ]. Represents numbers, uppercase and lowercase letters, and underscores. \ W [^ 0-9amurzAmurZZ]. Non-word characters. \ s [\ t\ v\ n\ r\ f]. Represents a space character, including spaces, horizontal tabs, vertical tabs, line feeds, carriage returns, and page feeds. \ S [^\ t\ v\ n\ r\ f]. Non-blank character. [^\ n\ r]. A wildcard that represents almost any character. Except for line feeds, carriage returns, line delimiters, and segment delimiters. \ uxxxx looks for Unicode characters specified in the hexadecimal number xxxx. \ f matches a feed character (Utt000C). \ nmatches a newline character (Ubun000A). \ r matches a carriage return (Udest000D). \ t matches a horizontal tab (Ubun0009). \ v matches a vertical tab (Utt000B). \ 0 matches NULL (Ubun0000) characters, do not follow this with other decimals, because\ 0 is an octal escape sequence. [\ b] matches a backspace (Ubun0008). Don't be confused with\ b. ) [abc] any of a, b, or c [^ abc] not a, b, or c [a murg] character between a & g

1.2 position matching

The character description\ b is the word boundary, specifically the position between\ w and\ W, including the position between\ w and ^, as well as the position between\ w and. Specifically, it is the position between and, and, and. \ B is the opposite of\ b, not a word boundary. For example, in all the positions in the string, deduct\ b, and the rest are\ B. The position where the ^ abc$ string begins and ends

1.3 groups

Character description (abc) capture group, capture group\ nbackreference to group # n, grouping reference, referring to the matching content of the nth capture group, where n is a positive integer (?: abc) non-capturing group, non-capture group

1.4 advance assertion

The character indicates a (? = b) positive lookahead, an assertion that a matches a (?! B) negative lookahead only before b, and a negative assertion that a matches only if it does not precede b.

1.5 subsequent assertions

Character description (? 1.6 quantifiers and branches

The character indicates that between one 0 or morea+1 or morea?0 or 1a {5} exactly fivea {2,} two or morea {1 two or morea 3} between one & threea+?a {2,}? match as few as possible, lazy matching means as few matches as possible.

The following are all inert matches: {mdjinn}? {m,}? +? *?

1.7 Branch

Character description ab | cdmatch ab or cd, which matches the substring 'ab'' or 'cd''

1.8 modifier

The character indicates that I performs a case-insensitive match. G performs a global match (looking for all matches instead of stopping after the first match is found). M performs multi-line matching. U turn on "Unicode mode" to correctly handle Unicode characters greater than\ uFFFF. That is, the four-byte UTF-16 encoding is handled correctly. S allows. Matches the newline character. The yy modifier acts like the g modifier, which is also a global match, and the latter match starts from the next location where the previous match was successful. The difference is that the g modifier only needs to have a match in the remaining position, while the y modifier ensures that the match must start in the first remaining position, which is the meaning of "adhesion"

two。 Operator precedence

Operator description\ escape character (), (?:), (? =), [] parentheses and square brackets *, +,?, {n}, {n,}, {n food m} qualifier ^, $,\ any metacharacter, any character anchor and sequence (that is, position and order) | replace, "or" the operator character takes precedence over the replacement operator so that "m | food" matches "m" or "food". To match "mood" or "food", use parentheses to create a subexpression that produces "(m | f) ood".

3. Regular backtracking 3.1 what is a backtracking algorithm

Here is a partial explanation from Wikipedia:

Backtracking is a general computer algorithm that is used to find all (or some) solutions to some computing problems, especially constraint satisfaction problems, to build candidate solutions step by step. and immediately abandon the candidate ("backtracking") to complete the effective solution when the candidate is impossible.

Backtracking is usually implemented with the simplest recursive method, and two situations may occur after repeating the above steps over and over again:

Find a correct answer that may exist

Declare that there is no answer to the question after trying all possible step-by-step methods

In the worst case, the backtracking method will lead to a computation with exponential time complexity.

3.2 what is regular backtracking

Regular engines can be divided into two categories: one is DFA (Deterministic finite automaton deterministic finite automaton), the other is NFA (NFA Non-deterministic finite automaton nondeterministic finite automaton). NFA is slower and more complex to implement than DFA, but it has much more powerful features than DFA, such as support for backreferences. The regular engines of languages such as javaScript, java, php, python, and c # are all NFA, and backtracking is used in the implementation of the NFA regular engine.

3.2.1 regularities without backtracking

To take a common example on the Internet, the regular expression / ab {1je 3} c abbc', g matches the text 'abbc',. We will analyze the matching process through RegexBuddy, and the use of RegexBuddy is introduced in a subsequent chapter.

As shown in the figure above, let's break down the matching process step by step:

The regular engine matches a first.

Regular engines match b as much as possible (greedily).

The regular engine matches c to complete the match.

In the meantime, the matching process went smoothly and there was no accident (backtracking).

3.2.2 regularities with regular backtracking

Let's change the above rule to / ab {1 ~ 3} c _ ab g to / ab {1 ~ 3} bc/g, and then check the analysis results through RegexBuddy.

Let's break down the matching process step by step:

The regular engine matches a first.

The regular engine matches b in b {1jue 3} as much as possible (greedily).

Regular engine to match b, found no b, too bad! Go back!

To return to the step of b {1pm 3}, you can't be so greedy that there is less b to match.

Regular engine to match b.

The regular engine matches c and completes the match.

The above is a simple backtracking process.

3.3 several common forms of regular backtracking

From the above example of regular backtracking, we can see that the process of regular backtracking is a process of trial and error, which is the essence of backtracking algorithm. Backtracking will increase the matching steps, which is bound to affect the performance of text matching. therefore, in order to improve the matching performance of regular expressions, it is very important to understand the scene (form) of backtracking.

3.3.1 greedy quantifier

In the NFA regular engine, quantifiers are greedy by default. When the quantifier shown in the following table is used in the regular expression, the regular engine will initially match the text that satisfies the quantifier as greedily as possible. When there is no match, there will be backtracking, trial and error until failure or success.

Quantifier indicates axiom 0 or morea+1 or morea?0 or 1a {5} exactly fivea {2,} two or morea {1pr 3} between one & three

When multiple greedy quantifiers exist next to each other and conflict with each other, the principle of "first come, first served" is adhered to, as follows:

Let string = "12345"; let regex = / (\ d {1pr 3}) /; console.log (string.match (regex)); / / = > ["12345", "123i", "45", index: 0, input: "12345"] 3.3.2 inert quantifiers

Greed is an important cause of backtracking, so can we avoid backtracking if we try to match the text in a lazy way? The answer is no.

Let's go back to the original example, / ab {1pm 3} c _ hand g to match abbc. Next, let's change the rule to / ab {1jue 3}? C ab g to match the abbc, and match the text in a lazy way. The RegexBuddy steps are shown below:

The regular engine matches a first.

The regular engine matches b in b {1jue 3} as little as possible (lazy).

Regular engine to match c, crap! How there is a b block, can not match c ah! Go back!

To return to the step of b {1pm 3}, you can't be so lazy and match more b.

Regular engine to match c again, too bad! Why is there a b block? it can't match c! Go back!

To return to the step of b {1pm 3}, you can't be so lazy and match one more b.

Regular engine to match c, the match is successful, very good!

It is a good rule that backtracking does not occur, because backtracking occurs after lazy quantifiers are used for lazy matching. Therefore, inert quantifiers can not be used blindly, the key is to look at the scene.

3.3.3 grouping

The matching rule of branches is: match one by one according to the order of branches, and when the previous branches meet the requirements, discard the following branches.

Take a simple branch, Chestnut, and use regular expressions to match / abcde | abc/g text abcd, or view the steps via RegexBuddy:

Regular engines match a.

Regular engine matches b.

Regular engines match c.

Regular engines match d.

Regular engine matches e, crap! The next one is not e, go back!

The previous branch can't get through, switch branches, and the second branch regular engine matches a.

The second branch regular engine matches b.

The second branch regular engine matches c, and the match is successful!

From this, we can see that the process of grouping matching is also a process of trial and error, and backtracking may occur in the middle.

4. Regular analysis and debugging

RegexBuddy is a very powerful tool for learning, analyzing, and debugging regular expressions. RegexBuddy supports more than a dozen mainstream programming languages, such as C++, Java, JavaScript, Python and so on. Through RegexBuddy, you can see the step-by-step process of creating regularities. Combined with the test text, you can see the regular step-by-step matching process, which is of great help to understand regular backtracking and further optimize regularization.

4.1 install analysis and debugging tools

RegexBuddy can be downloaded and obtained from the official website of RegexBuddy.

After downloading, click install step by step.

4.2 introduction to the tool interface

The following figure shows the panels and related functions of the RegexBuddy interface.

4.3 create a regular

For ease of use, you can set the layout to Side by Side Layout in the layout settings.

Enter your regular regex1 in the regular input area, look at the Create panel, you will find that the panel shows the regular creation process (or matching rules), enter your test text in the Test panel area, the part that meets the regex1 matching rules will be highlighted, as shown in the following figure.

4.4 use the Debug function of RegexBuddy

Select the test text, click debug can enter RegexBuddy debug mode, I think this is the most powerful place of RegexBuddy, because it allows you to clearly know your input regular test text matching process, how many steps have been performed, where backtracking occurred, where optimization is needed, you can see at a glance.

4.5 use the Library function of RegexBuddy

Many common regularities are built into RegexBuddy's regular library, and many regular expressions needed in the daily coding process can be found in this regular library.

4.6 more tool recommendations

Regular Visualization-regexper

Regular Visualization-regulex

Regular on-line debugging

5. Regular performance optimization

Regular is a very useful tool, if used properly, with the help of God, you can save a lot of code. When used improperly, it is buried everywhere. Therefore, the focus of this chapter is to summarize how to write a high-performance regular expression.

5.1 avoid quantifier nesting

Let's take a simple example to compare:

Let's use the regular expression / a*b/ to match the string aaaaa. Take a look at the following figure of RegexBuddy execution:

Let's modify the above rule to / (a *) * b / to match the string aaaaa, and then take a look at the execution result of RegexBuddy:

The above two basic execution steps of the rule can be simply considered as:

Greedy matching

Backtracking

Until a match is found to fail.

But surprisingly, the first regular process from the beginning to the failure of the match is only 14 steps. The second rule has as many as 128 steps. It is conceivable that nested quantifiers will greatly increase the execution of regularization. Because there are two layers of backtracking, the process of increasing the execution step is like the process of increasing the complexity of the algorithm from O (n) to O (n ^ 2).

Therefore, in the face of quantifier nesting, we need to make appropriate transformations to eliminate these nesting:

(a *) * (a +) * (a *) + a * (a +) + av 5.2 use non-capture groups

Parentheses in the NFA regular engine serve two main purposes:

Mainstream function to increase the operation priority of the contents in parentheses

Reverse reference

Reverse referencing is a powerful feature, and the powerful cost is performance consumption. So, when we don't need to use parenthesis reverse references, we should try to use non-capture groups, that is:

/ / capture group and non-capture group () = > (?:) 5.3 branch optimization

Branches are also an important reason for regular backtracking, so we also need to make necessary optimizations for regular branches.

5.3.1 reduce the number of branches

First, you need to reduce the number of branches. For example, when matching http and https, many regulars like to write:

/ ^ http | https/

In fact, the above can be optimized to:

/ ^ https?/

This reduces unnecessary branch backtracking.

5.3.2 shrink the content within the branch

It is also necessary to narrow down the content in the branch. For example, if we need to match this and that, we might write:

/ this | that/

But the above can actually be optimized to

/ th (?: is | at) /

Some people may think that there is no difference between the above. Let's match that with the above two regular expressions.

We will find that the first regular execution step is two steps more than the first regular, because the first regular backtracking path is longer than the second regular backtracking path, resulting in longer execution steps.

5.4 Anchor point optimization

Use anchor points as much as possible when you can. Most regular engines do some extra analysis during the compile phase to determine whether there are characters or strings necessary for a successful match. Anchor matching like ^ and $can give more optimization information to regular engines.

For example, the regular expression hello (hi)? $can only start with the penultimate character at the end of the string during the matching process, so the regular engine can analyze the jump to that position, skipping many possible characters in the target string and greatly improving the matching speed.

Thank you for your reading. the above is the content of "how to master regular expressions". After the study of this article, I believe you have a deeper understanding of how to master regular expressions, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.