How to view the AST of a regular expression 04/18 Update SLTechnology News&Howtos

How to view the AST of a regular expression

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to view the AST of regular expressions", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "how to view the AST of regular expressions" this article.

Regular expressions are basically used in string processing, which is very convenient for string matching, extraction, replacement and so on.

However, it is difficult to learn regular expressions, such as greedy matching, non-greedy matching, capture subgroup, non-capture subgroup and other concepts, which are difficult to understand not only for beginners, but also for many people who have been working for several years.

Then how to learn regular expressions better? How to master regular expressions quickly?

The matching principle of regular expressions is to parse the pattern string into AST, and then use this AST to match the target string.

All kinds of information in the pattern string will be saved in AST after parse. AST is abstract syntax tree, the meaning of abstract syntax tree, as the name implies, is a tree organized according to the syntax structure, then you can easily know the syntax supported by regular expressions from the structure of AST.

How do I check the AST of a regular expression?

You can visually view it through the website astexplorer.net:

By switching the language of parse to RegExp, you can visualize the AST of regular expressions.

As mentioned earlier, AST is a tree organized by syntax, so it is naturally easy to understand various grammars from its structure.

So let's learn various grammars from the perspective of AST:

/ abc/

Let's start with something simple. / abc/ is a regular that matches the string of 'abc'. Its AST looks like this:

3 Char with values a, b, c, and type simple. The match after that is to traverse the AST, matching the three characters respectively.

We tested it with exec's api:

The 0th element is the matching string, and index is the starting subscript of the matching string. Input is the input string.

Let's try some special characters:

/\ d\ d\ d /

/\ d\ d\ d / means to match three numbers, and\ d is a metacharacter (meta char) with special meaning supported by the rule.

We can also see from AST that although they are also Char, they are of type meta:

You can match any number through the metacharacters of\ d:

What is meta char and which is simple char is clear at a glance through AST.

/ [abc] /

The rule supports specifying a set of characters by [], that is, matching any one of the characters.

We can also see from the AST that it is wrapped in a layer of CharacterClass, which is the meaning of the character class, that is, it can match any kind of characters it contains.

And that's exactly what happened in the test:

/ a {1,3} /

Regular expressions support specifying how many times a character is repeated, in the form of {from,to}

For example, / b {1 abc 3} / indicates that the character b is repeated 1 to 3 times, and / [abc] {1pr 3} / indicates that the a/b/c character class is repeated 1 to 3 times.

As you can see from AST, this syntax is called Repetition:

He has an attribute of quantifier to represent a quantifier, and the type here is range, from 1 to 3.

The rule also supports some abbreviations of quantifiers, such as + for 1 to countless times, * for 0 to countless times,? Represents 0 or 1 time.

They are different types of quantifiers:

Some students may ask, what does the greedy attribute mean here?

Greedy means greed, and this attribute indicates whether the Repetition matches greedily or non-greedily.

If you add a? after the quantifier, you will find that greedy becomes false, that is, switching to non-greedy matching:

What does greed and non-greed mean?

Let's just look at an example.

The default Repetition match is greedy, and it always matches as long as the conditions are met, so the acbac can match here.

Add a quantifier after it? If you switch to non-greed, you will only match the first one:

This is greedy matching and non-greedy matching, through AST we can clearly know that greed and non-greed are for repetitive grammar, the default is greedy matching, add an after the quantifier? You can switch to non-greed.

(aaa) bbb (ccc)

Regular expressions support the return of a portion of the matched string into a subgroup through ().

Take a look at this via AST:

The corresponding AST is called Group.

And you will find that it has an attribute of capturing, which defaults to true:

What does that mean?

This is the syntax for subgroup capture.

If you don't want to capture subgroups, you can write (?: aaa)

Look, capturing has become false.

So what's the difference between capture and non-capture?

Let's try this:

Oh, it turns out that the capturing attribute of Group represents whether it is extracted or not.

We can see through AST that capture is for subgroups, and the default is capture, that is, to extract the contents of subgroups. You can switch to non-capture by switching to non-capture, so the contents of subgroups will not be extracted.

We are already familiar with using AST to understand regular grammar, so let's see something difficult:

/ bbb (? = ccc) /

Regular expressions support the expression of advance assertions through the syntax of (? = xxx), which is used to determine whether a string is preceded by a string.

You can see from AST that this syntax is called Assertion, and the type is lookahead, that is, looking forward, matching only the previous meaning:

What does that mean? Why would you write that? What's the difference between / bbb (ccc) / and / bbb (?: ccc) /?

Let's try this:

As can be seen from the results:

/ bbb (ccc) / matches the subgroup of ccc and extracts this subgroup because the default subgroup is captured.

/ bbb (?: ccc) / matches the subgroup of ccc but is not extracted because we set the subgroup not to capture through?:.

/ bbb (? = ccc) / subgroups that match ccc do not extract subgroups, indicating that they are also uncaptured. The difference between it and?: is that ccc does not appear in the matching result.

This is the nature of lookahead assertion: antecedents represent a string preceded by a string, the corresponding subgroup is uncaptured, and the asserted string does not appear in the matching result.

If it is not followed by that string, it does not match:

/ bbb (? ccc) /

Change? = to?! Then the meaning changed. Take a look at it through AST:

Although the lookahead assertion is still asserted in advance, there is an attribute with negative as true.

It is obvious that there is a certain string in front of it, but not a certain string in front of it after negation.

Then the matching result is exactly the other way around:

Now it matches if it is not a certain string, which is a negative antecedent assertion.

/ (?

The above is all the content of the article "how to View the AST of regular expressions". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.