Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the use of regular expressions

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

The purpose of this article is to share with you about the use of regular expressions. I think it is very practical, so I share it with you. I hope you can get something after reading this article.

The basic concept of regular expression

When we write a page, we often need to verify the data of the form, such as account number, ID number, etc., and the most effective and most frequently used is to use regular expressions to verify. So what is a regular expression?

A regular expression (Regular Expression) is a pattern used to describe a set of string characteristics and to match a specific string. It has a wide range of applications, especially in string processing. Its common applications are as follows:

Validate a string, that is, to verify that a given string or substring meets the specified characteristics, such as verifying that it is a legitimate e-mail address, verifying that it is a legitimate HTTP address, and so on.

Find a string, and find a string that matches the specified characteristics from a given text, which is more flexible than finding a fixed string.

A replacement string, that is, a string that matches a characteristic is found and replaced.

Extract a string, that is, extract a substring that meets the specified characteristics from a given string.

Part one: tools for regular expressions

As the saying goes, if you want to do good work, you must first sharpen its tools! So we need to know the following main tools:

In http://www.regexpal.com/, we can test regular expressions online.

Http://regexr.com/ is a more recommended site, and it includes an example that allows us to test it directly.

Part two: metacharacters of regular expressions

We probably hear the most metacharacters in regular expressions. Metacharacters (Metacharacter) are a very special class of characters that can match a position or a character in a character set. For example,.,\ w and so on are metacharacters.

As I just said, metacharacters can match both positions and characters, so we can divide metacharacters into metacharacters in matching positions and metacharacters in matching characters.

Metacharacters of A matching position-^, $,\ b

That is, the only metacharacters that match positions are ^ (caret), $(dollar sign) and\ b. Matches the beginning of a line, the end of a line, and the beginning or end of a word, respectively. All they match is the location.

1.^ match the start position of the line

For example, ^ zzw matches "zzw" that begins with "zzw" (note: despite the addition of a ^, it still matches a string, not a whole line!) If zzw is not a string that begins a line, it will not be matched

2.$ the end of the matching line

For example, if zzw$ matches "zzw" with "zzw" at the end of the line (again, where $is just a matching position, and that position is zero width, not a whole line), if zzw is not the end of the line, it will not be matched.

So combining ^ and $, it's not hard to guess that ^ zzw$ matches only the string zzw on a line.

The ^ $match is a blank line that does not contain any strings.

3.\ b matches the beginning or end of a word

For example,\ bzzw matches zzw with spaces, punctuation, or line feeds before zzw (note:\ b matches only a zero-width position, not spaces, punctuation, or line feeds).

Zzw\ b matches the zzw of zzw followed by a space, punctuation, or newline (again,\ b matches a position of zero width).

Obviously\ bzzw\ b matches the zzw that must be followed by a space, punctuation, or newline symbol before and after the zzw.

B matches the metacharacter of the character. ,\ w,\ W,\ s,\ S,\ d,\ D

That is, there are seven metacharacters that match characters.

Where. (period) indicates that it matches any character except the newline character

\ w matches word characters (not only letters, but also underscores, numbers, and Chinese characters);\ W matches any non-word characters (note, the opposite of\ w)

\ s matches any white space characters (such as spaces, tabs, newline characters, Chinese full-width spaces, etc.)

\ s matches any non-white space character (note: just the opposite of\ s)

\ d match any number

\ D matches any non-numeric character (note: just the opposite of\ d).

Examples are as follows:

The above four examples are expressed from left to right:

^. $indicates the only character that matches any non-newline character on a line

\ ba\ w\ b matches a word starting with the letter a followed by nine letter characters. (note: the an is either a metacharacter or an ordinary character, which we call the string string literal-the string literal is what it looks like.)

\ b\ w\ w\ d\ b matches a word that starts with three alphabetic characters followed by four numeric characters and the last one is not a numeric character.

Part III: text matching of regular expressions

We will learn this part through character class, character escape and antonym.

A, character class

Character classes are "mini" languages in regular expressions that can be defined in [].

The simplest character class can consist of [] and a few simple letters. For example, [aeiou] can match any of the five letters of Aeiou. [0123456] can match any of the seven numbers of 0123456. Instead, it can be matched to any of the HTML tags. The [bhc] at matches the strings bat, hat, and cat. That is, multiple characters in the character class [] match only one of them.

But obviously [0123456], this representation is too troublesome and needs to be written a lot, so we can use-hyphen) to simplify it, such as [0-6] and. So we know that [0-9] and\ d have the same effect. [Amurz] can represent all lowercase letters, and [Amurz] can represent all uppercase letters. [a-zA-Z] can represent all uppercase and lowercase letters.

It is worth noting that-(hyphen) means "to" only when it is in the middle of the character class. In [- b] 5-is not between two characters, so it represents-5 or b5.

In addition, we know that ^ only matches the beginning of the line, but if ^ appears in the first position in the character class, it denies the character class. For example, [^ 123] matches any other character that is not the number 1 or 2 or 3. [^ -] indicates that matches any character that is not a -.

From this we can also find that when metacharacters (-, ^, etc.) are used in character classes, there is no need for escape operations.

More commonly used are [^ aeiou] matching characters other than vowels, [0-9aMuzAmurZZ] matching any number, letter (uppercase and lowercase) and underscore, which is equivalent to\ w, [^ 0-9aMuzAmurZZ] matching any non-numeric, letter (uppercase and lowercase) and underscore, which is equivalent to\ W.

B, character escape

We know that metacharacters such as &, ^,. If we want to think of them as ordinary characters to match strings, and they happen to be not in the character class (such as [&]), we need to escape using\ (backslash).

For example, we can use www.yisu.com to match www.yisu.com. We can use\ * to match the * (wildcard) in the string. We can also match through\. Examples are as follows:

c. Antisense

In fact, we've already said that before, and I repeat it here because I want to attract attention, that is, ^ at the beginning of the character class denies the characters in this character class. For example, a [^ b] matches an and a characters that are not followed by b. Another example is] > to indicate a matching character. Examples are as follows:

Part IV: qualifiers for regular expressions

What is a qualifier? We know that in the previous example, I used\ ba\ w\ b to match words with nine letter characters after the letter a, which is obviously troublesome to write. If only we could write these repetitions in a simple form ~ yes, that's what qualifiers are for. Using the qualifier we can rewrite it as\ ba\ w {9}\ b. Yes, it's that simple! Let's learn more next.

{n} means to repeat n times, such as\ w {5} to match a word character.

{n,} means to repeat at least n times, such as\ w {5} to match at least 5 word characters, or 6 or 7 characters.

{nrecast m} means to repeat at least n times and m times at most, such as\ w {5pm 10} means to match at least 5 words and up to 10 word characters.

* means to repeat at least 0 times. Equivalent to {0,}, that is, hu*t can match ht or hut or huut or huuut.

+ means repeat at least once. Equivalent to {1,}, that is, hu+t can match hut or huut or huuut.

? Means to repeat 0 or 1 times. It is equivalent to {0 1}, that is, colou?r means to match color or colour.

Obviously, the above qualifier refers to a certain character in front of the qualifier.

But what if we add one after the qualifier above? At this time, we call it the lazy qualifier. Accordingly, we call the above matches greedy matching.

{n}? Equivalent to {n}

{n,}? Repeat as little as possible, but at least n times

{n,m}? Repeat between n and m times, but use as few repetitions as possible.

*? Use as few duplicate first matches as possible

+? Repeat as little as possible, but use it at least once

?? Use zero repeats (if possible) or one repetition

For example, for aabab, the string to be matched, the use of a.roomb matches aabab, while the use of a.roomroomb matches aab and ab instead of all.

Part V: the operation of the characters of regular expressions

Character operations include substitution, grouping, and backreferencing, which I will describe item by item.

A replacement

What is replacement? Obviously, it means that one does not work, and I will replace it with another, for example, 0\ d {3} -\ d {7} | 0\ d {2} -\ d {8} means to match the phone number with the first four digits as the area code and the last seven digits as the local number. It can also match the first three digits as the area code and the last eight digits as the local number. | it means to replace it. For example, [Jj] ack and Jack | the matching effect of jack is the same as that of Jack or jack. That is to say, the relationship between replacement | Yes or operation.

In general OR operations, 0 0 result is 0, 0 1 result is 1, 10 result is 1, 11 result is 1. The same is true in regular expressions: if none of them match, they don't match; if one matches, they match one; if they both match, they match two. Examples are as follows:

b. Grouping

Grouping is also a very important concept in regular expressions. Seemingly complex, grouping is actually treated as a whole by using "(" and ")", that is, opening and closing parentheses to enclose certain characters as a whole.

For example, we want to match abcabc. If abc {3} matches abccc, this does not meet expectations, so we can group abc, that is, (abc) {3} to match the string we want.

For example, (\ d {1Magne3}\.) {3}\ d {1Magne3} also uses the grouping operation, which can be used to match simple IP addresses, as shown below:

C back reference

Above we can group through (), and at the same time, each group is automatically assigned a group number, which can represent the expression of the group.

The rules for grouping are: from left to right, marked by the left parenthesis of the group, the group number of the first group is 1, the group number of the second group is 2, and so on.

At this point, backreferences come in handy. We can use it to backreference the character group enclosed with (). How to quote it exactly? The rules are as follows:

\ numeric, using numeric named backreferences. Note: this is a common way.

\ k, using the specified named backreference. Note: this is a way supported by the .NET Frameword.

The following examples are backreferences that use numeric naming:

We can see that the first match is not the same as the second one-the first match is a word made up of two arbitrary word characters, while the second one is because of the use of reverse references. then it must be a word made up of two repeating word characters.

The last one I used was two groupings. According to the grouping rules,\ w {3}\ d {2} is the first grouping and\ d {2} is the second grouping. Also note: the same character is backreferenced, such as www55www5566, and the last two bits are different, so it can't match correctly.

Use a backreference that specifies the naming (that is, custom naming)

For the second example above, a custom named backreference can be written as\ b (?\ w)\ k\ b or\ b (? 'myName'\ w)\ k\ b. I wanted to give an example to try, but all the results indicated an error. It may be that the two online websites mentioned above do not support it.

Of course, if we want to think of it as a whole, rather than numbering it, we can use the (?: expression) approach. As follows:

In addition, the following groups are also commonly used:

(? = expression) matches the position before the string expression

(?! expression) matches a position that is not followed by the string expression

(?

(? > expression) matches the string expression only once

d. Zero width assertion

The ^ and $introduced earlier are matching locations that meet certain conditions. Here, one of the conditions satisfied is an assertion or a zero-width assertion.

The commonly used ones are:

^ match the start position of the row

The end position of the $match line

\ a match must appear at the beginning of the string

The\ Z match must appear at the end of the string or before the\ newline character good n at the end of the string

\ z the match must appear at the end of the string

\ G match must appear at the end of the previous match

\ b the start or end position of the matching character

\ B matching is not at the beginning or end of the character

Previously mentioned (? = expression), (?! expression), (? Where (? = expression) is called zero-width predictive antecedent assertion, which asserts that the expression expression can be matched after its own position. For example,\ b\ w + (? = ed\ b) can match the first part of a word that ends with the string ed, such as reset in reseted.

Among them (?

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report