Example Analysis of regular expression Grammar in UNIX/LINUX SHELL 07/04 Update SLTechnology News&Howtos

Example Analysis of regular expression Grammar in UNIX/LINUX SHELL

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly shows you the "example analysis of regular expression syntax in UNIX/LINUX SHELL", which is easy to understand and well-organized. I hope it can help you solve your doubts. Let me lead you to study and study the "sample analysis of regular expression syntax in UNIX/LINUX SHELL".

Almost all important issues need to filter out useful data from useless data. Know a lot about UNIX? How the command line utility uses regular expressions to extract the essence.

It is very strange that to this day I can still repeat the classic song "Conjunction Junction" on Saturday morning. Whether this is a good thing (watching too much TV) or a bad thing (perhaps a harbinger of my current career) remains to be discussed. In any case, this minor conveys a basic message at a cheerful pace.

I haven't come up with a work similar to "Conjunction Junction" for learning UNIX, but I will try to write such a song myself in the next few months. At the same time, taking advantage of the good mood brought by happy memories, we continued to conquer the command line with the traditional learning style of Schoolhouse rock.

The class begins now. Spit out the gum in your mouth, go back to your seat, and take out a number two pencil. And you, Spicoli.

Imitation show

You can think of the UNIX command line as one sentence:

Executable commands, such as cat or ls, are verbs-actions.

The output of the command is a noun-the data to be looked up or used.

Shell operators, such as | (pipe) or > (redirect standard output), are conjunctions-used to connect sentences.

For example, the command line: ls-A | wc-l is used to calculate the number of entries in the current directory (ignore special entries. And.), which contains two sentences. The first sentence, ls-A, is a verb structure, listing the contents of the current directory, and the second sentence, wc-l, is another verb structure, which is used to count the number of lines. The output of the first sentence is used as the input of the second sentence, and the two sentences are connected by conjunctions (pipes).

Many of the command-line sentence patterns that you may have learned in this and other articles have this sentence structure.

However, without grammatical modifiers, the command line will appear unprofessional. Of course, basic sentences can also get the job done, but it doesn't look beautiful. (I would like to apologize to the high school English singing duo Ms. Rad and Ms. Perlstein. (adjectives are needed to solve more interesting problems.

Almost all important issues need to filter out useful data from useless data. Although the number and type of attributes vary, each scenario implicitly or explicitly describes the information it is looking for and processing in some way (form or format), resulting in another form of other information.

On the command line, a regular expression acts as an adjective-a description or qualifier. When applied to the output, regular expressions can distinguish between related data and irrelevant data.

Overview of punctuation

Let's look at an example problem.

The grep utility filters input line by line and looks for matches. The simplest use of grep is to print lines that contain text that matches a pattern. Grep can find character combinations in a fixed order and can even ignore case by using the-I option.

Therefore, assume that the file heroes.txt contains the following line:

Catwoman

Batman

The Tick

Spider Man

Black Cat

Batgirl

Danger Girl

Wonder Woman

Luke Cage

The Punisher

Ant Man

Dead Girl

Aquaman

SCUD

Spider Woman

Blackbolt

Martian Manhunter

Command line:

Grep-I man heroes.txt

Will generate:

Catwoman

Batman

Spider Man

Wonder Woman

Ant Man

Aquaman

Martian Manhunter

Grep scans every line in the heroes.txt file and looks for the letter m, followed by a, followed by n. In addition to having to be adjacent, these letters can appear anywhere on a line, or even in the middle of a larger word. Regardless of case (the-I option), Catwoman, Batman, Spider Man, Wonder Woman, Ant Man, Aquaman, and Martian Manhunter all contain the string man.

The grep utility contains other built-in options to optimize your search. For example, the-w option is limited to matching the entire word, so grep-I-w man will exclude Catwoman and Batman (for example).

The tool also has an excellent feature that excludes rather than includes all matching search results. Use the-v option to exclude matching rows. For example:

Grep-v-I 'spider' heroes.txt

All lines will be printed except for the string spider.

Catwoman

Batman

The Tick

Black Cat

Batgirl

Danger Girl

Wonder Woman

Luke Cage

The Punisher

Ant Man

Dead Girl

Aquaman

SCUD

Blackbolt

Martian Manhunter

But how do you deal with the following situations? Just want words that start with "Bat", or words that start with "bat", "Bat", "cat" or "Cat"? Or want to know how many cartoon Avengers' names end with "man". In these examples, simple string searches similar to the above three examples will not meet the requirements because they are location-insensitive.

Location, location, and options

Regular expressions can filter specific positions, such as the beginning or end of a line, and the beginning and end of a word. Regular expressions (usually abbreviated as regex) can also describe options (you can call them "this" or "that"); repetitions of fixed, variable, or indefinite length; ranges (for example, "any letter between Amurm"); types or types of characters ("printable characters" or "punctuation"); and other techniques.

Table 1 shows some common regular expression operators. You can join the elements (and other operators) shown in Table 1 and combine them to build (very) complex regular expressions.

Table 1. Common regular expression operators

The purpose of the operator. (period) matches any single character. ^ (caret) matches an empty string that appears at the beginning of the line or at the beginning of the string. The $(dollar sign) matches the empty string that appears at the end of the line. A matches the capital letter A. A matches the lowercase letter a. / d matches any digit. / D matches any single non-numeric character. / w matches any single alphanumeric character, with the synonym [: alnum:]. [Amure] matches any uppercase A, B, C, D, or E. [^ Amure] matches any character except A, B, C, D, and E. X? Matches the capital letter X that appears zero or once. X* matches zero or any uppercase Xs. X+ matches one or more letters X. X {n} exactly matches the n letter X. X {n ·m} matches at least n and no more than m letter Xs. If m is omitted, the expression will try to match at least n Xs. (abc | def) + match A series of (at least one) abc or def;abc and def will match.

Here are some examples of regular expressions that use grep as a search tool. Many other UNIX tools, including interactive editors vi and Emacs, stream editors sed and awk, and all modern programming languages support regular expressions. After you learn the syntax of regular expressions (which may be quite obscure), you can flexibly apply your expertise to different tools, programming languages, and operating systems.

Find names that start with "Bat"

To find a name that starts with "Bat", use:

Grep-E'^ Bat'

You can use the-E option to specify a regular expression. The ^ (caret) character matches the beginning of a line or string, which is an imaginary character that appears before the beginning of each line or string. The letters B, a, and t have only a literal meaning and match only those specific characters. Therefore, the command grep-E'^ Bat' generates:

Batman

Batgirl

Because many regex operators are also used by Shell (some of them have different uses, others have similar uses), it is a good practice to enclose each regex on the command line in single quotation marks to protect the regex operator from being misunderstood by Shell. For example, * (asterisk) and $(dollar sign) are regex operators and have a special meaning for your Shell.

Find a name that ends with "man"

To find a name that ends with "man", you can use regex man$ to match the sequences m, a, and n, followed by a line (string) that matches the regex operator $.

Find blank lines

Based on the use of ^ and $, you can use regex ^ $to find blank lines (equivalent to lines that end immediately after you start).

Alternative or collection operator

To find words that start with "bat", "Bat", "cat", or "Cat", you can use the following two techniques. The first is the alternative, which will produce a matching result if any of the patterns in the alternative match. For example, the command:

Grep-E'^ (bat | Bat | cat | Cat) 'heroes.txt

This technique can be realized. The regex operator | (vertical bar) indicates an alternative, so this | that matches the string this or the string that. Therefore, ^ (bat | Bat | cat | Cat) means "the beginning of the line immediately follows one of bat, Bat, cat, or Cat." Of course, you can use grep-I to simplify the regex, which ignores case, thus simplifying the command to:

Grep-I-E'^ (bat | cat) 'heroes.txt

Another way to match "bat", "Bat", "cat", or "Cat" is to use the [] (square brackets) collection operator. If you put a set of characters in a collection, you can match any of those characters. (you can think of a collection as an abbreviation for character options. )

For example, the command line:

Grep-E'^ [bcBC] at' heroes.txt

The result is the same as the following command:

Grep-E'^ (bat | Bat | cat | Cat) 'heroes.txt

You can use-I again to simplify regex to ^ [bc] at.

Also, you can use the-(hyphen) operator to specify the range of characters contained in the collection. For example, a user name usually begins with a letter. Suppose you want to validate such a user name in a Web table submitted to your server, you can use a regex like ^ [A-Za-z]. This regex means "the beginning of a string is followed by any uppercase letter (Amurz) or any lowercase letter (Amurz)." By the way, [A murz] has the same effect as [A-Za-z].

You can also mix ranges and individual characters in the collection. Regex [A-MXYZ] will match any uppercase AME M, X, Y, and Z.

Also, if you want to reverse the collection (that is, exclude any characters in the collection), you can use the special collection [^] and include the range or characters to exclude. The following is an example of reversing a collection. To find all superheroes whose names contain at and exclude Dark Knight and Batman, type:

Grep-I-E'[^ b] at' heroes.txt

This command generates:

Catwoman

Black Cat

Because some collections need to be used frequently, simplified symbols are designed to replace a large number of characters. For example, the collection [A-z0-9 _] is very common, so it can be abbreviated to / w. Similarly, the operator / W is an abbreviation for the collection [^ A-z0-9]. You can also use the symbol [: alnum:] instead of / w and [^ [: alnum:]] instead of / W.

By the way, / w (and synonyms [: alnum:]) are region-specific, while [A-z0-9] means the letter Amurz, the number 0-9, and the underscore. If you are developing an internationalized application, use a region-specific format to make your code portable between many regions.

Repeat with me: repeat, repeat, repeat

So far, the literal value, position, and two alternative operators have been introduced. Using this content alone, you can match most patterns with predictable length. Now back to the user name, you can ensure that each user name begins with a letter and follows exactly seven letters or numbers with the following regex command:

[Amurz] [a-z0-9] [a-z0-9]

But it's a little clumsy. Also, it only matches a user name that is exactly eight characters. It does not match names between three and eight characters, which is usually a valid user name.

Regular expressions can also include duplicate modifiers. Repeat modifiers can specify a number, such as none, one, more, one or more, zero or one, five to ten, and exactly three. Duplicate modifiers must be combined with other patterns, and the modifiers themselves have no meaning.

For example, regex:

^ [Amurz] [A-z0-9] {2jue 7} $

The user name filtering function described earlier can be implemented. The user name begins with a letter, followed by a string of at least two letters or numbers, but not more than seven letters or numbers, followed by the end of the string.

The location point here is very important. If there are no two location operators, a user name of any length will be accepted incorrectly. Why? Please consider regex:

^ [Amurz] [A-z0-9] {2jue 7}

This command identifies whether the string begins with a letter and is followed by two to seven letters. But it does not mention the termination condition. Therefore, the string samuelclemens meets the criteria, but its length is obviously beyond the scope of a valid user name. Similarly, omitting the starting anchor ^, or both, will match a string that ends with a similar munster1313 or contains the string, respectively. If you must match a specific length, remember to add a delimiter at the beginning and end of the required pattern, respectively.

Here are some other examples:

You can use {2,} to find two or more repeats. Regex ^ G [o] {2,} gle matches Google, Gooogle, Goooogle, and so on.

Repeat modifiers?, +, and * look for zero or one time, one or more times, and zero or more times, respectively. (for example, can you put? It is regarded as an acronym for {0jue 1}. )

Regex boys? Match boy or boys;regex Goo?gle match Gogle or Google.

Regex Goo+gle matches Google, Gooogle, Goooogle, and so on.

Construct Goo*gle matches Gogle, Google, Gooogle, and so on.

You can apply repeating modifiers to individual characters (as shown above), as well as to more complex combinations. Use (and) parentheses (as in mathematics) to apply modifiers to subexpressions. Here is an example: a given text file test.txt:

The rain in Spain falls mainly

On the the plain.

It was the best of of times

It was the worst of times.

The command grep-I-E'(/ b (of | the) / W+) {2,} 'test.txt generates:

On the the plain.

It was the best of of times

The regex operator / b matches the boundary of the word or (/ W _ | / w _ W). The regex means "a series of complete words' the' or 'of' followed by non-literal characters." You may ask why / W+ is required: / b is an empty string at the beginning or end of a word. You must include this (or these) characters between words, or the regex will not find a match.

Capture what you should pay attention to

Finding text is a common problem, but a more common problem is that you want to extract the text after it is found. In other words, you want to remove the rough and extract the essence.

Regular expressions extract information through capture. If you want to separate the desired text from the rest of the content, enclose the pattern in parentheses. In fact, you have used parentheses to collect terms; by default, parentheses are captured automatically.

To view the capture, switch to Perl. The grep utility does not support capture because its goal is to print lines that contain patterns. )

The following command:

Perl-n-e'/ ^ The/s+ (. *) $/ & & print "$1BO" 'heroes.txt

Will print:

Tick

Punisher

Use the command perl-e to run the Perl program directly from the command line. The perl-n command runs the program once for each line of the input file. The regex part of the command, the text (/) between the slashes, indicates that "matches the beginning of the string, followed by the letters'T','h', and'e' followed by one or more space characters / slots, and then captures all characters up to the end of the string.

The Perl capture is placed in a special Perl variable that starts with $1. The rest of the Perl program prints the captured content.

Each nested pair of parentheses, counting from the left, each left parenthesis plus one, is placed in the next special numeric variable. For example:

Perl-n-e'/ ^ (/ w) +-(/ w +) $/ & & print "$1 $2"

Will generate:

Spider Man

Ant Man

Spider Woman

Capturing the text you are interested in is just scratching your feet. If the material can be accurately determined, other materials can be used to change its appearance. Editors like vi and Emacs combine pattern matching and substitution to combine find and replace text into one step. You can also use mode, substitution, and sed to change text from the command line.

Rich themes

Regular expressions are very powerful; there are a large number and variety of operators available. It contains so much information and practical knowledge that what we can list here is rare.

Fortunately, there are three excellent sources of regular expression theory available:

If you have Perl on your system, you can refer to the Perl Regular Expression man page (type perldoc perlre). It provides a wonderful introduction to regex and contains many useful examples. Many programming languages have adopted Perl-compatible regular expressions (PCRE), so what you read on this man page has been converted directly to PHP, Python, Java? And the Ruby programming language, as well as many other latest tools.

Regular expressions (third Edition), edited by Jeffrey Friedl, is considered to be the Bible of regex usage. The book explains meticulously, accurately, clearly, and pragmatically how matches work, all the regex operators, most priorities (limiting the number of + and * matching characters), and more. In addition, Friedl's book includes some amazing regular expressions that accurately match fully qualified e-mail addresses and other Request for Comments (RFC)-specific strings.

Nathan Good's book Regular Expression Recipes provides useful solutions to many common data processing and filtering problems. If you need to extract a zip code, phone number, or referenced string, try Nathan's solution.

There are many ways to use regular expressions on the command line. Almost every command that processes text supports some form of regular expression. Most Shell command syntax also more or less extends regular expressions to match file names (although the functionality of operators may be different).

For example, type ls [a murc] to find a file named a, b, or c. Type ls [a murc] * to find all file names that start with a, b, or c. Here * does not modify [amurc] like grep's interpreter in Shell, * is interpreted as. *. ? Operators can also work in Shell, but are interpreted as.., that is, to match any single character.

Check the documentation for your favorite utility or Shell to determine which regex operators are supported and how unique they may be.

Unix grep regular expression metacharacter

A regular expression is a text pattern consisting of ordinary characters (such as characters a to z) and special characters (called metacharacters). The pattern describes one or more strings to be matched when finding the body of the text. The regular expression acts as a template that matches a character pattern with the searched string.

Marks the next character as a special character, or a literal character, or a backward reference, or an octal escape character. For example,'n 'matches the character "n".' / n' matches a newline character. The sequence'/ 'matches "/" while "/" matches "(".

Matches the starting position of the input string.

Matches the end position of the input string.

Matches the previous subexpression zero or more times. For example, zo* can match "z" and "zoo". * is equivalent to {0,}.

Matches the previous subexpression one or more times. For example, 'zo+' can match "zo" and "zoo", but not "z". + is equivalent to {1,}.

Matches the previous subexpression zero or once. For example, "do (es)?" Can match "do" in "do" or "does". ? It is equivalent to {0jue 1}.

{n}

N is a non-negative integer. Match the determined n times. For example,'o {2} 'does not match the' o'in 'Bob', but does match the two o in 'food'.

{n,}

N is a non-negative integer. Match at least n times. For example,'o {2,} 'does not match' o'in 'Bob', but does match all o in 'foooood'. O {1,}'is equivalent to 'oasis'. O {0,}'is equivalent to 'oval'.

{n,m}

M and n are non-negative integers, where n

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.