What are the main points of regular expressions 07/15 Update SLTechnology News&Howtos

What are the main points of regular expressions

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces what are the main points of regular expressions, which are introduced in great detail and have certain reference value. Friends who are interested must finish reading them.

Regular expression (Regular Expression) is a powerful, convenient and efficient text processing tool.

The metacharacters of regular expressions are combined with other characters to achieve the desired function.

Typical metacharacters:

^ delimit, the beginning of a line, matches the position of the beginning of the line

The dollar sign, the end of a line, matches the position at the end of the line

| | Vertical bar, pipe, multi-select structure, and match any subexpression that separates both sides |

(...) Brackets, limit the scope of vertical bar (pipe). It is not supported by grep alone. You need to use grep-E or egrep

[.] Character set that matches one of several characters (regular expression structure Construct [], listing the characters expected to match somewhere)

[^...] Exclusive character group that matches any unlisted character

-hyphen, which represents a range and is a metacharacter when used within a character group

. A simple way to write a group of characters used to match any character.

< 反斜线小于，单词分界符\ 反斜线大于，单词分界符\>

Used to match the end of a word (character combination, alphanumeric symbol)

The meaning of the same character may be different in different positions, so we should pay attention to the distinction. Adding the parameter-I indicates that the match is case-insensitive.

Case analysis:

Cat looks for cat anywhere in a line of text, such as cat, catalog, scat, scatter

^ cat looks for the cat at the beginning of the line, such as catalog

Cat$ looks for the cat at the end of the line, such as scat

^ cat$ contains only rows of cat

^ $blank line, without any characters (including blank characters)

It doesn't make sense to use it alone.

\ match the word cat

^ (From | Subject): looks for a line that starts with From: or Subject:

Inclusive character group: [123456] matches any number from 1 to 6

Match ~, which can be written as,-is a hyphen and represents a range when used within a character group.

[0-9] matches a number

[amurz] matches a lowercase letter

[Amurz] matches an uppercase letter

[0-9A Mustang Zhenjiang.?] Match a number, capital letter, underscore, exclamation point, dot, or question mark

[0123456789abcdefABCDEF] can be written as [0-9a-fA-F] or [A-Fa-f0-9]. The order does not matter and is suitable for dealing with hexadecimal numbers.

[abc], [a murc], (a | b | c) denote the same meaning, but not all characters can do so. The efficiency of character groups is relatively high.

Exclusive character group: [^...] Match any unlisted characters, the ^ at the beginning of the character group indicates negate, and list the characters you don't want to match.

N [^ d] matches a line that is not followed by the character d

[^ d] matches lines that do not contain only the letter d, that is, exclude lines that contain only the letter d

The exclusive character group means "match unlisted characters (match a character that's not listed)" rather than "do not match listed characters (don't match what is listed)".

How to match the file directory / path:

. / ([^ /] +) / match. / A line that is not only / but can be other characters (appear as many times as possible), followed by /

It is found that the regular expression is not rigorous enough.

Because。 Is to match a single arbitrary character (not necessarily. It could also be letters or other characters)

To match a separate one. Need to add escape character\ or escape character ^

Example:

[root@CentOS6 shell] # grep-E'\. / ([^ /] +) / 'dir.txt # plus escape character\

. / home/

. /... yes../

. / new/

. / 100/

. / China /

. / 2015/

. / 2014-12-31 /

[root@CentOS6 shell] # grep-E'^. / ([^ /] +) / $'dir.txt # plus line beginning and end anchors ^ and $

. / home/

. /... yes../

. / new/

. / 100/

. / China /

. / 2015/

. / 2014-12-31 /

[root@CentOS6 shell] # cat dir.txt

/ /

. / /

. / home/

. /... yes../

. / ~ /

. / new/

. / 100/

. / China /

. / 2015/

. / 2014-12-31 /

/ hometown/

Yesterday

@ China

. /

. / /

Abcd//~/

.. / /.. /

[root@CentOS6 shell] #

Use a dot. Match any character: metacharacter. (also known as dot dot, dot point) is an easy way to write a group of characters that match any character.

If you need to use a placeholder (placeholder) that matches any character in the expression, use a period. It's convenient.

For example:

The expression can be written as a character group: 03 [-. /] 19 [-. /] 76 to search for 03-19-76, 03-19-76.

You can also use a period. Alternate character group: 03.19.76

Relatively speaking, 03 [-. /] 19 [-. /] 76 is more accurate, 03.19.76. Match to other characters as well. Which one to use depends on the target text.

A period within a character group. It is not a metacharacter, it only represents ordinary dot characters.

Whether a hyphen within a character group represents a range depends on its position, not at the beginning or end, but in the middle (and not after [^). For example, the-in [.-/] is the range, but the-in [^-/.] and [. / -] is not the range.

Pipe | or structure: matches any subexpression | it is also a metacharacter, meaning or (or), through which different subexpressions can be combined into a total expression, in which the subexpression is called "alternative".

For example: GREA] y can be written as grey | gray, or gr (a | e) y.

Be careful not to write gra | e] y, where the | is just an ordinary character because it is within the character group.

A multi-selection structure can include many characters, but cannot exceed the boundaries of parentheses.

A character group can only match a single character in the target text, and each multi-selection structure itself may be a complete regular expression that can match text of any length.

(first | 1st) has the same meaning as (fir | 1) st

(First | 1st) [Ss] treet and (Fir | 1) st [Ss] treet have the same meaning

Be careful when using the delimited character ^ and the dollar character $in an expression that contains a multi-selected structure.

Analysis:

^ From | Subject | Date: match ^ From or Subject or Date:

^ (From | Subject | Date): match ^ From: or ^ Subjec: or ^ Date: commonly used to extract information from E-mail files

The matching results of the two are different.

If you want each multi-selected branch to be preceded by the character ^, followed by:, you need to constrain these multi-selected branches with parentheses.

Typical usage:

Grep-E 'From: | Subject:' test.txt

Grep-E'^ (From | Subject | Date): 'test.txt / / single quotation mark or double quotation mark is OK, expand regular

Egrep'^ (From | Subject | Date): 'test.txt

Metacharacter sequence\ word delimiter: a problem often encountered when using regular expressions is that you expect the matching "word" to be included in another word.

The word delimiter\ is used to match the beginning and end of the word. Note: they are not metacharacters themselves, and the whole sequence has a special meaning (called a metacharacter sequence) only when they are combined with a slash\. Not all versions of egrep support word delimiters.

Analysis:

\ means to match the beginning of the word, followed by the three letters cat, followed by the end of the word, simply match the word cat.

You can also use\ to match words that begin and end with a word (or combination of letters) cat, respectively.

Egrep (equivalent to grep-E) identifies the beginning of a word with an upward arrow and the end of a word with a downward arrow.

The beginning and end of a word is, to be exact, the beginning and end of alphanumeric symbols.

Optional elements, wildcard Wildcards:

Optional items optional element

Question mark? Stands for optional, represents any single character, and adding it to a character means that this character is allowed here, but its occurrence is not a necessary condition for a successful match. Which character is optional? Just after that character. Egrep or grep-E is required.

Color and colour can be matched by colou?r.

July and Jul can be accessed through July? Or (July | Jul) to match

Fourth | 4th | 4 can pass through fourth | 4 (th)? To match with nested parentheses, question marks? The object of the function is the content in the whole parenthesis.

The purpose of parentheses and reverse reference parentheses:

1. Limit the scope of multiple options

2. Combine several characters into a unit, question mark? Or quantifiers such as asterisk *, such as four (th)?, (a) *

3. Backreferencing backreference allows matching the same text that matches the previous part of the expression

For example:\ the parentheses () and\ 1 here are used to support reverse references.

In tools that support backreferences, parentheses () can "remember" the text that the subexpressions match, and the metacharacter sequence\ 1 can remember them no matter what the text is. Moreover, multiple parentheses can be used in an expression, and\ 1,\ 2,\ 3, and so on, are used to represent the text that matches the first, second, and third sets of parentheses. The parentheses are carried out in the order in which the opening parentheses "(" appears from left to right.

For example,\ 1 in ([amurz]) ([0-9])\ 1\ 2 represents the matching content of [amurz], and\ 2 represents the matching content of [0-9].

Escape. It is a metacharacter itself, which can match any character, including spaces.

Really matches the dot in the text. Should be a combination of backslashes (backslash) and dots:

Aga\ .att\ .com

\. Called "escaped period" or "escaped period", this method applies to all metacharacters, but is not valid within the character group.

The backslash used in this way becomes the "escape"-the meta character it acts on loses its special meaning and becomes a normal character.

You can also use\ ([a-zA-Z] +\) to match a word in parentheses, such as (very). The backslash before the opening and closing parentheses removes the special meaning of the opening and closing parentheses, so you can match the opening and closing parentheses in the text.

Variable name:

Many programming languages have the concept of identifiers (identifier, such as variable names).

Identifiers contain only letters, numbers, and underscores, but cannot start with a number.

You can match identifiers with [a murz Amurz Z _] [a-zA-Z_0-9] *:

The first character group matches the first character that may appear

The second (including the corresponding *) matches the remaining characters.

String in quotation marks:

The easiest way to match a string within quotation marks is to use the expression: "[^"] * "

The quotation marks at both ends are used to match the quotation marks at the beginning and end of the string.

The text between these two quotation marks can include any character other than double quotation marks.

Here [^ "] is used to match any character except double quotation marks, and * is used to indicate that any number of non-double quotation mark characters can exist between two quotation marks.

Dollar amount (which may include decimals):

\ $[0-9] + (\. [0-9] [0-9])? It's a way to match the amount of dollars.

Three parts:\ $, … +, (...) ?

Match a dollar sign, a number before the decimal point (one or a group of digits), a decimal point and the number after it (one decimal point and two digits), respectively, to match a dollar amount such as $100.1, but not an amount such as $1100. The decimal part is optional.

If you want to match a line that contains only the price and no other characters, you can add ^ at both ends of the expression. $, that is:

^\ $[0-9] + (\. [0-9] [0-9])? $begins with the dollar sign $, ends with a number, and must contain a decimal point. The matching result is different from the former.

This expression does not match $.49.

Because:

^ delimit, the beginning of a line, matches the position of the beginning of the line

The dollar sign, the end of a line, matches the position at the end of the line

Web site: HTTP/HTML URL

A Web URL can take many forms, so it is difficult to construct a regular expression that matches all forms of URL.

However, it is relatively simple to match only most common URL.

Common HTTP/HTML URL styles are as follows:

Http://hostname/path.html (or htm)

The rules for hostnames (hostname) (such as www.yahoo.com) are complicated, but because hostnames are generally followed by http://, they can be written as:

[- a-z0-9room.] + (maybe it should be [- a-z0-9room.:] +)

The path part has more transformations, which need to be written as follows:

[- a-z0-9% $] *

Note: hyphens used as ordinary characters-must be placed at the beginning of the character group, and metacharacters that represent the range are used in the middle of the character group.

To sum up, it is:

Egrep-I'\ 'files

A more simplified version:

Egrep-I'\ 'files

Some wrong results may be matched, and the expression can be adjusted according to the specific requirements.

These are all the contents of the article "what are the main points of regular expressions?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.