Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the characters and character sets in Perl regular expressions

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what are the characters and character sets in Perl regular expressions. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

⑴ metacharacter

The regular expression language consists of two basic character types: literal text characters and metacharacters (metacharacter). The literal text character is the actual text character or space to match, while the metacharacter is one or a group of characters instead of one or more characters that can be used for fuzzy matching. The meanings of commonly used metacharacters and their expressions are shown in the following table:

The meta of metacharacters is actually the meaning of a wildcard (but is not a system with the wildcard of linux). In Perl, a backslash\ is a special metacharacter. To match the metacharacter itself (rather than what it means in the regular expression), you can add a backslash before the corresponding metacharacter, such as'.','*', and'\ 'respectively match'.','*','\'in the text. In addition, the metacharacter ^ matches the beginning of the line, represents a complement within the collection character [], and the metacharacter $matches the end of the line.

⑵ pattern grouping and capture

In Perl regular expressions, parentheses () is also a special metacharacter, which is used to group strings. Many metacharacters only operate on a single character, and after grouping, you can manipulate multiple characters, as follows: fred+ # matches freddddd. (fred) + # can match fredfredfred … Strings in parentheses, that is, within a pattern group, can also be backreferenced to operate. The referenced pattern group is also called a capture group (capture group). The reference method is a backslash plus a capture group sequence number, as follows: (.)\ 1 # matches an arbitrary character and repeats it once, that is, it matches two consecutive identical characters (... (.) D\ 1 # matches four arbitrary characters after the beginning of y, and the beginning of d also starts with two words of the same character, for example, yabba dabbay (.) (.)\ 2\ 1 # matches any two characters after the beginning of y, followed by words in reverse order of these two characters, which is a reverse reference to the palindrome structure y ((.)\ 3\ 2) d\ 1 # nested structure similar to yabba This matches word groups with four-character palindromes starting with y and d, such as yabba dabba capture group numbers for complex nested structures. Perl has a very simple rule, just according to the order from left to right left parentheses. If the backreference capture group number is followed by a number, more parentheses may be needed to disambiguate, and starting from Perl 5.10, the backreference can be in the format\ g {n}, as follows:

(.)\ g {1} 11 # matches characters like aa11

In this format, you can also use relative positions for numbering:

(.) (.)\ g {- 1} 11 # matches characters like xaa11

Relative backreference uses a negative sign to refer to the capture group on the left, and-1 is the capture group closest to the reference on the left, which avoids the embarrassment that all numbers must be changed after adding parentheses, and is more conducive to program maintenance.

In many cases, we just want to fill in parentheses for grouping, but do not want to change the numbers of all backreferences. We can only enable the pattern grouping function of parentheses and turn off the capture function, and add a?: modifier in the left parentheses, as follows: y (?: (.)\ 2\ 1) d (?: (.) (.)\ 4\ 3) # outer parentheses only serve as pattern grouping Can match phrases similar to yabba deffe structure

⑶ character set

A character set (character class) is a set of possible characters represented by an expression written in square brackets [] that matches a single character contained in the set. For example, [abcxyz] can match any of the a, b, c, x, y, z that appear in a string, and a hyphen-range can be used between contiguous characters. For example, the above expression can be written as [a-cx-z], and if the hyphen itself is included in the collection (rather than the meaning of the range), a backslash can be used to escape. ASCII characters can be represented with a backslash plus octal number encoding, for example, [\ 000 -\ 177] matches all 127 ASCII characters. Add a delineated character ^ at the beginning of the collection to take a complement, for example, [^ 0-9] matches characters other than numbers. For the Unicode character set, in addition to matching by encoding like\ x {2668}, you can also use Unicode attributes, for example, many characters belong to the blank characters Space, numeric Digit, and so on, then the matching expressions are\ p {Space} and\ p {Digit}.

The character set appears to abbreviate regular expressions, and the character set can also be abbreviated for example\ d for [0-9] and\ w for [a-zA-Z0-9]. However, after Perl moved from the ASCII era to the Unicode era, the abbreviations of the character set are broader.\ d it can match not only ordinary numbers, but also a variety of digital writing methods in other languages. Starting with Perl 5.14, you can add the modifier an after the regular expression delimiter (see the next section for more information on delimiters and modifiers), then the regular expression matches strictly according to the ASCII code, for example, /\ d ASCII an is equivalent to / [0-9] /. Character set abbreviations change lowercase letters to uppercase letters to become a complement. For example,\ D under ASCII encoding can represent [^ 0-9]. In addition, it is interesting that [\ d\ D] will match any character and include newline characters, which is better than'.' It covers a wider range.

⑷ metacharacter priority

Like operators or functions, metacharacters of regular expressions have precedence issues. The metacharacter priority rules are as follows:

① at the top of this priority table is the parenthesis (), which is used when grouping and backreferencing schemas. Any part inside the parenthesis is more closely combined than the part outside the parenthesis.

The second level of ② is the quantifier, namely asterisk (*), plus sign (+), question mark (? ) and quantifiers represented by curly braces, such as {5pm 15}, {3,}, {5}, etc., which are usually closely combined with the previous element

The third level of ③ is anchoring and sequence (sequence). Anchoring includes the beginning ^, ending $, word delimiter\ b, and non-word delimiter\ B. sequence (one element followed by another) is actually an operation, although it does not use metacharacters.

The lowest priority of ④ is the vertical bar, which indicates or, because it has the lowest priority, it usually divides the pattern into several parts. This is the end of this article on "what are the characters and character sets in Perl regular expressions?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report