What should be paid attention to when using regular expressions 07/03 Update SLTechnology News&Howtos

What should be paid attention to when using regular expressions

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail what you need to pay attention to when using regular expressions. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

A few days ago, an online project monitoring information suddenly reported an exception. After going to the machine to check the usage of related resources, it was found that the CPU utilization was close to 100%. Through the threading Dump tool that comes with Java, we have exported the stack information of the problem.

We can see that all the stacks point to a method called validateUrl, and there are more than 100 error messages in the stack. By examining the code, we know that the main function of this method is to verify that URL is legal.

It's strange how a regular expression can lead to high CPU utilization. In order to figure out the recurrence problem, we extracted the key code and did a simple unit test.

Public static void main (String [] args) {String badRegex = "^ ([hH] [tT] {2} [pP]: / / | [hH] [tT] {2} [pP] [sS]: / /) (([A-Za-z0-9mura] +).) + ([A-Za-z0-9mura ~\ /]) + $"; String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";" If (bugUrl.matches (badRegex)) {System.out.println ("matchmakers!");} else {System.out.println ("no matchmakers!");}}

When we run the above example, we can see through the resource monitor that the CPU utilization of a process called java has soared directly to 91.4%.

Seeing here, we can basically infer that this regular expression is the killer that led to the high utilization of CPU!

So we focus on that regular expression:

^ ([hH] [tT] {2} [pP]: / / | [hH] [tT] {2} [pP] [sS]: / /) (([A-Za-z0-9muri ~] +).) + ([A-Za-z0-9mura ~\ /]) + $

This regular expression looks fine and can be divided into three parts:

The first part matches http and https protocols, and the second part matches www. Characters, the third part matches many characters. I looked at this expression in a daze for a long time, but I didn't see that there was no big problem.

In fact, the key reason for the high utilization of CPU here is that the engine implementation of Java regular expressions is NFA automata, which will backtracking during character matching. Once backtracking occurs, it takes a long time, either a few minutes or hours, depending on the number and complexity of backtracking.

See here, perhaps everyone is not very clear what is backtracking, but also a little confused. It doesn't matter, let's start little by little with the principle of regular expressions.

Regular expression engine

Regular expression is a very convenient matching symbol, but to implement such a complex and powerful matching syntax, there must be a set of algorithms to implement, and the thing that implements this algorithm is called regular expression engine. To put it simply, there are two ways to implement a regular expression engine: DFA automata (Deterministic Final Automata deterministic finite automata) and NFA automata (Non deterministic Finite Automaton uncertain finite automata).

For these two automata, they have their own differences, and it is not intended to delve into their principles. To put it simply, the time complexity of DFA automata is linear and more stable, but its function is limited. The time complexity of NFA is unstable, sometimes very good, sometimes not so good, depending on the regular expression you write. But the advantage is that NFA is more powerful, so languages including Java,. NET, Perl, Python, Ruby, PHP and so on all use NFA to implement their regular expressions.

So how exactly does the NFA automaton match? Let's use the following characters and expressions to illustrate.

Text= "Today is a nice day." regex= "day"

It is important to remember that NFA matches based on regular expressions. That is, NFA automatically reads one character of the regular expression and matches it with the target string. If the match succeeds, it will replace the next character of the regular expression, otherwise it will continue to compare with the next character of the target string. Maybe you don't quite understand, it's okay, let's use the above example to analyze it step by step.

First, get the first match of the regular expression: d. So compare it with the character of the string, the first character of the string is T, do not match, change to the next one. The second one is o, and it doesn't match, so change it to the next one. The third is d, which matches, so read the second character of the regular expression: a.

Read the second match of the regular expression: a. That continues to be compared with the fourth character an of the string and matches again. Then read the third character of the regular expression: y.

Read the third match of the regular expression: y. That continues to be compared with the fifth character y of the string and matches again. Try to read the next character of the regular expression and find that it is gone, so the match ends.

The above matching process is the matching process of NFA automata, but the actual matching process will be much more complex than this, but its principle is unchanged.

Traceback of NFA Automata

Now that you understand how NFA does string matching, we can talk about the main point of this article: backtracking. In order to better explain backtracking, we also use the following example to explain.

Text= "abbc" regex= "ab {1jcr 3} c"

The purpose of the above example is relatively simple, matching strings that start with an and end with c, with 1-3 b characters in the middle. The process of parsing it by NFA is as follows:

First, the first match an of the regular expression is compared with the first character an of the string. So the second character of the regular expression is read.

Read the second match of the regular expression, b {1prit 3}, and compare it with the second character b of the string. But because b {1jue 3} represents 1-3 b strings, and the greedy nature of NFA automata (that is, to match as many as possible), instead of reading the match of the next regular expression, b {1jue 3} is still compared with the third character b of the string to find that it still matches. So we continue to use b {1p3} to compare with the fourth character c of the string and find that it doesn't match. Backtracking occurs at this point. What is the operation of backtracking?

After the backtracking occurs, the fourth character c of the string we have read will be spit out and the pointer will return to the position of the third string. After that, the program reads the next operator c of the regular expression, reads the next character c of the current pointer and compares it to find a match. So read the next operator, but it's over here.

Let's go back to the previous regular expression that verifies URL:

^ ([hH] [tT] {2} [pP]: / / | [hH] [tT] {2} [pP] [sS]: / /) (([A-Za-z0-9muri ~] +).) + ([A-Za-z0-9mura ~\ /]) + $

The URL with the problem is:

Http://www.fapiao.com/dzfp-web/pdf/download?request=6e7JGm38jfjghVrv4ILd-kEn64HcUX4qL4a4qJ4-CHLmqVnenXC692m74H5oxkjgdsYazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf

We divide this regular expression into three parts:

The first part: verification protocol. ^ ([hH] [tT] {2} [pP]: / / | [hH] [tT] {2} [pP] [sS]: / /).

The second part: verify the domain name. ([A-Za-z0-9] +) +) +.

The third part: check the parameters. ([A-Za-z0-9 ~\]) + $.

We can find that there is no problem with the regular expression verification protocol http://, but when verifying www.fapiao.com, it uses xxxx. To check in this way. So in fact, the matching process goes like this:

Match to www. Match to fapiao.

When you match to com/dzfp-web/pdf/download?request=6e7JGm38jf., you will find that because of greedy matching, the program will always read the following string to match, and finally find that there is no period, so it goes back one by one.

This is the first problem with this regular expression.

Another problem is that in the third part of the regular expression, we find that the URL in question has an underscore (_) and a percent sign (%), but not in the regular expression corresponding to the third part. This will cause you to find a mismatch after matching a long string of characters in front of you, and then go back.

This is the second problem with this regular expression.

Solution

After you understand that backtracking is the cause of the problem, you will find that if I add an underscore and a percent sign to the third part, the program will be normal.

Public static void main (String [] args) {String badRegex = "^ ([hH] [tT] {2} [pP]: / / | [hH] [tT] {2} [pP] [sS]: / /) (([A-Za-z0-9mura] +).) + ([A-Za-z0-9mura /%\\]) + $"; String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";" If (bugUrl.matches (badRegex)) {System.out.println ("matchmakers!");} else {System.out.println ("no matchmakers!");}}

Run the above program and it will print out matchmakers immediately.

But this is not enough, if there are other URL containing messy characters in the future, can we modify it again? It must be unrealistic!

In fact, there are three patterns in regular expressions: greed, laziness, and monopolization.

In terms of quantity matching, there are +? * {min,max} four twice. If they are used alone, then they are greedy patterns.

What if you add one more after them? Symbol, then the original greedy pattern becomes lazy, that is, as few matches as possible. But the laziness pattern can still be traced back. For example, the following example:

Text= "abbc" regex= "ab {1 Magne3}? C"

The first operator an of the regular expression matches the first character an of the string and matches successfully. So the second operator of the regular expression, b {1pr 3}? Matches the second character b of the string, and the match is successful. Because of the minimum matching principle, the third operator c of the regular expression is matched with the third character b of the string, and a mismatch is found. So go back and take the second operator of the regular expression, b {1pr 3}? Matches the third character b of the string, and the match is successful. Then take the third operator c of the regular expression to match the fourth character c of the string, and the match is successful. So it's over.

If you add an extra + symbol after them, the original greedy pattern becomes exclusive, that is, matching as much as possible, but not backtracking.

Therefore, if the problem is to be solved completely, it is necessary to ensure that backtracking does not occur while ensuring functionality. I put an extra + sign after the second part of the regular expression that verifies URL above, which looks like this:

^ ([hH] [tT] {2} [pP]:\ / / | [hH] [tT] {2} [pP] [sS]:\ / /) (([A-Za-z0-9 murmura] +).) +-- > (a + sign is added here) ([A-Za-z0-9 murals / /]) + $

After that, there will be no problem running the original program.

Finally, recommend a website that can check if there is a problem when the regular expression you write matches the corresponding string.

Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript

For example, the URL that I have a problem with in this article will use the site to check and prompt: catastrophic backgracking (catastrophic backtracking).

When you click "regex debugger" in the lower left corner, it will tell you how many steps have been checked, and it will list all the steps and indicate where the backtracking occurred.

The regular expression in this article stopped automatically after a 110,000-step attempt. This shows that there is a problem with this regular expression and needs to be improved.

But when I test with the regular expression we modified, the following regular expression.

^ ([hH] [tT] {2} [pP]:\ / / | [hH] [tT] {2} [pP] [sS]:\ / /) (([A-Za-z0-9mura] +).) + ([A-Za-z0-9mura ~\ /]) + $

The tooltip completed the check in 58 steps.

This is the end of this article on "what to pay attention to when using regular expressions". I hope the above content can be of some help to you, so that you can learn more knowledge. If you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.