How to optimize regular expressions 07/15 Update SLTechnology News&Howtos

How to optimize regular expressions

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to optimize regular expressions, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's learn about it!

Let's start with a little bit of knowledge about regular expressions that you may not know, which is useful for our future optimization.

The common grep (global regular expression print) can be regarded as the origin of today's regularization (from the concept of regularization proposed by neuroscientists to the modeling of mathematicians to the fact that it is not widely used by IBM, until it eventually becomes an independent tool of grep). Now the rules we use are divided into two schools by POSIX (portable operating system interface): BREs (base regular expressions) and EREs (extended regular expressions). The POSIX program must support one of the two. The two have different characteristics that need to be understood. For more information, please see the section "comparison of regular expressions of types in Common 3" in the shell script Learning Guide [1].

Regular matching engines can be divided into two main categories: DFA and NFA. The former is deterministic finite automaton and the latter is non-deterministic finite automaton. There is a compilation principle inside, if you are interested in another wiki. Now there are three types of regular engines:

1. DFA engines execute in linear time because they do not require backtracking (and therefore they never test the same character twice). The DFA engine also ensures that the longest possible string is matched. However, because the DFA engine contains only a limited number of states, it cannot match a pattern with reverse references, and because it does not construct a display extension, it cannot capture subexpressions.

2. Traditional NFA engines run the so-called "greedy" matching backtracking algorithm to test all possible extensions of regular expressions in a specified order and accept the first match. Because traditional NFA constructs specific extensions of regular expressions to achieve successful matches, it can capture subexpression matches and back references to matches. However, because of traditional NFA backtracking, it can access exactly the same state multiple times (if it is reached through a different path). Therefore, in the worst case, its execution speed can be very slow. Because the traditional NFA accepts the first match it finds, it may also cause other (and possibly longer) matches to go undiscovered.

3. POSIX NFA engines are similar to traditional NFA engines, except that they continue to backtrack until they can ensure that the longest possible match has been found. As a result, POSIX NFA engines are slower than traditional NFA engines; and when using POSIX NFA, you probably don't want to support shorter matching searches rather than longer ones while changing the order of backtracking searches.

Depending on the regular engine, we can sum up two universal rules:

1. Give priority to the leftmost matching result.

2. Standard matching quantifier (* +? {n.rem}) is a priority match.

Here are some simple examples of optimization:

For example,'. * [0-9] [0-9] 'matches the string "abcd12efghijklmnopqrstuvw". In this case, the matching method is'. * 'matches the entire line first, but does not satisfy the matching of the following two digits, so'. * 'just return a character' wicked', still can not match, continue to return a 'vested', loop the character to'2' and find that one matches, but still can not match the two digits. So continue to return'1'. Such a situation should be avoided as far as possible after we understand this feature. If we want a string to contain two numbers, just match the two numbers directly, and don't write wildcards like'. *'. The study of optimization is actually the study of the underlying implementation, because optimization is to achieve the desired effect along the implementation of the tool as far as possible, if you do not understand the bottom of the tool you are using, you can't know what is appropriate and which tool is efficient.

From the above introduction to DFA and NFA, we know the difference between them. To put it simply, NFA is the expression leading engine and DFA is the text leading engine. In general: the DFA class engine will only match each character of the target string once, but NFA will backtrack, and text-driven DFA will be faster than expression-led NFA.

If you think about it carefully, you may notice that how to avoid NFA backtracking as much as possible will be a major problem for our optimization of regular NFA engines. There is also a question of whether to ignore priority matching, which also needs to be optimized. There is also a simple explanation for priority matching, because this point is also mentioned in the universal rules mentioned above. If you match to a position, when you need to make the choice of trying to match or skip matching, for quantifier matching, the engine will give priority to trying, and skip trying to match when ignoring quantifier priority. For example, how these two points work and why they need to be optimized:

Using ab?c to match abc, the program flow looks like this:

It's okay to match a first, and then when you match b, the engine will be because? Number to consider whether to match b, the default is quantifier priority, so first try to match, the other choice is put in the alternative state. This matches the ab, and then successfully matches to c, so the program ends and the alternative state is abandoned.

If you still use ab?c to match ac, when the program runs to b, it will first try to match b and find that it will not work. At this time, it will backtrack, that is, it will return to the state of matching a, and then the program will continue to run match c, and then finish successfully. This process is traced back, and the process of learning the algorithm is easy to understand. It is similar to the last-in-first-out of the stack, so that it is always convenient to trace back to the last legal state.

Then, ignore the priority matching. For example, if you use ab??c to match ac, the program matches a first, succeeds and then goes to baked matches. At this time, it will give up quantifier priority and skip the matching of b to match c first. In this way, the matching ends successfully without the previous backtracking process.

Look at the unsuccessful matches, let ab?x match abc, and you will find that this time the program matches a, and then you try to match a, then try to match b, and then try to match c, no, no. Then backtrack, and then move the starting position to try from b, and then try again unsuccessfully to start with c so as to get an unmatched report.

In general, you need to pay attention to avoiding backtracking and determining where your regularities need to avoid the principle of priority matching. The above example is very simple, but if you avoid backtracking, you can reduce the time complexity of the program from square O (n) to linear O (n), which is ideal.

The backtracking of the * sign and the + sign is similar to the above process, for example, Xreply, it can be regarded as x?x?x?x?. So that or (x (x (x)... ?)? Like this. Just imagine that this iteration goes deep into many layers, and suddenly there is an unmatched x, which needs to be backtracked layer by layer. And if it matches. * [0-9], such an expression, first of all, the match will match. *, which enables it to match the entire string, and then step by step backtracking, the returned character to match whether it is a number, in fact, it can directly match a number. So the above mentioned. * such wildcards, if not necessary, do not write such wildcards.

In addition, DFA does not support ignoring priority, only matching priority is supported. And. * this feature of greed is easy to ignore, and improper use will get results that we may not necessarily need. For example, if we want to match the content in parentheses (.*), the target string is abcd (aaaa) efg (ggg) h. According to the nature of. *, it will be returned from the first match to the end of the line (from the beginning to the end of the line, and then according to the demand), the problem arises, and the final matching result is (aaaa) efg (ggg), which is not what we expect. In fact, the regular expression we need is ([^ ()] *). This kind of error is especially careful to occur in html tags, such as 123456789, and if you want to match and replace, you will be very wrong. But you have tried to use a method like ([^)] *). Please think about the problem, you will be even more wrong. For example, 123 expression). Specifically, there is no difference between the solidified group and the normal match, but if the expression match is successful, the result will be solidified and any alternative state will be abandoned. Let's look at an example:\ w match: let him try to match helloworld. We can see at a glance that it doesn't match because it doesn't contain colons. In order to compare solidified matching, let's describe the process: first\ w matches to the end of the string, and then tries to match: sign, the obvious d doesn't match, so / w needs to return the next character to: number to match, r doesn't work either. In the end, it still doesn't match when it goes back to h, and then reports the result that it doesn't match. If you use solidified grouping mode (? >\ w +): to match the matching process of helloworld: first match to the end of the line, then find that the colon cannot be matched, and report that the match is not successful. Because we know that\ w cannot match symbols, if\ w can match, it will certainly not be a colon, so there is no need to retain the alternative state generated by\ w to allow the matching process to backtrack. Solidification grouping can eliminate these candidate states very well. If you want to try, make sure your tool supports regular solidification grouping.

There is another kind of possessive quantifier priority:? +, * +, + +, {mrecom n} +. In this kind of pattern matching, quantifiers will be matched first, and unlike quantifier priority matching, the part of quantifier matching in this pattern will not be returned, that is, the alternative patterns generated in the process of quantifier matching will be removed.

Multi-structure matching is similar to a | b | c. Traditional NFA performs sequential matching, and each branch exhausts all alternative states. The characteristic of this ordered matching is to explore a little optimization method, that is, to put the situation with high probability of success in front as far as possible.

A lot has been said above, most of which are related to NFA, and much of the regular optimization work is done for the NFA engine. Both DFA and NFA convert regular expressions into regular expressions suitable for their own algorithms in the pre-compilation stage, but DFA needs more memory and is slower than NFA, but DFA is faster than NFA in the process of formal matching execution, and sometimes you can't write regular expressions well, and NFA will fall into the awkward situation of not being able to end the match. But the reason why NFA is still mainstream is that it can provide features that DFA can't provide. For example, the various matching patterns just mentioned above are not provided by DFA.

It is not impossible for NFA and DFA to co-exist. Some tools have both matching engines to make themselves as efficient as DFA and as versatile as NFA. For example, grep and awk of GNU use an efficient DFA engine when completing matching tasks, and try to use DFA when completing complex tasks, and switch to NFA engine if the function can not meet the needs.

The above is a bit confusing to introduce some knowledge of DFA and NFA's regular engine and some examples of regular optimization. We also know that there are not many optimization strategies for regular expressions for DFA engines, and there are as few matching attempts as accurate and as few as possible when you write regular expressions. We have a lot of room to optimize regular expressions for the NFA engine, but before this you need to distinguish between the tools you use based on traditional NFA and POSIX NFA. Some problems may be only for one engine, but not for the other.

Avoid backtracking, but also avoid exponential growth. For example, the expression ([^ /] +) *: every time you match a character, you should consider whether it should belong to + quantifier or * quantifier, so that if you match a string of length 10, you need to backtrack 1023 times, not counting backtracking for the first time. This is the exponential growth rate of 2. If the string grows to 20, there are more than 1 million possibilities, often for several seconds, if it is 30. If it is more than a billion, you will have to run for hours. If the string length is more than 40, you will have to wait for more than a year. This actually gives us a way to identify the ownership of the regular engine for the tools we use:

1. If an expression can give a result even if it does not match, then it may be DFA, but it is possible.

2. If the result can be given quickly if it can match, it is the traditional NFA.

3. If it is always slow, it will be POSIX NFA.

The first one just says maybe, because the high-level optimized NFA can give results quickly.

Another is that the backtracking cost of the multi-selected structure is very high, such as: a | b | c | d | e | f and [a murf], the character array [a murf] only needs to do a simple test, but the multi-selected structure will have 6 more alternative states for backtracking in each position when matching.

Nowadays, many regular compilers do a lot of optimizations that you don't know, but common sense optimizations are always good if you notice, because it is uncertain whether the tool you use optimizes this piece.

1. If your regular tool supports it, use non-capture parentheses when you don't need to refer to the text in parentheses: (?: expression).

two。 If parentheses are not necessary, do not add parentheses.

3. Do not abuse character arrays, such as [.], please use\ directly. .

4. Use the anchor ^ $, which speeds up positioning.

5. Extract the necessary elements from the two times, such as: X + written as xx*,a {2jue 4} as aa {0jol 2}.

6. Extract the same characters at the beginning of the multi-selected structure, such as the | this and change it to th (?: e | is). (if your regular engine does not support this, change it to th (e | is)); especially anchor points, be sure to be independent, so that many regular compilers will make special optimizations based on anchor points: ^ 123 | ^ abc to ^ (?: 123 | abc). The same $is also as independent as possible.

7. An expression after the multi-selection structure is put into the multi-selection structure, so that when matching any multi-selection structure, the latter match can be viewed without exiting the multi-selection structure, and the match fails more quickly. This optimization needs to be used with caution.

8. Ignoring priority matching and priority matching requires you to depend on the situation. If you are not sure, please use match priority, which is faster than ignore priority.

9. Splitting large regular expressions into small regular expressions is very helpful to improve efficiency.

10. Simulate the anchor point and use the appropriate look structure to predict the appropriate starting matching position. For example, if you match 12 months, you can first check whether the first character matches: (? = JFMASOND) (?: Jan | Feb |... | Dec). This optimization should be used according to the actual situation, and sometimes it may be more expensive to look around the structure.

11. In many cases, the use of solidified grouping and priority quantifiers can greatly improve the speed.

twelve。 Avoid almost endless matches like (this | that) *. The above mentioned (… +) * is similar.

13. If the target string can be greatly shortened by simple matching, multiple regular matches can be carried out, which is very effective in practice.

The above is all the content of the article "how to optimize regular expressions". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.