Usage of grep, egrep, sed, awk, sort and uniq tools for regular expressions 07/06 Update SLTechnology News&Howtos

Usage of grep, egrep, sed, awk, sort and uniq tools for regular expressions

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

What is a regular expression?

Regular expressions are also known as regular expressions and regular expressions. It is often abbreviated to regex, regexp, or RE in code. A regular expression uses a single string to describe and match a series of strings that conform to certain syntactic rules. To put it simply, it is a method of matching strings, through some special symbols, to quickly find, delete, and replace a specific string.

A regular expression is a text pattern consisting of ordinary characters and metacharacters. Patterns are used to describe one or more strings to match when searching for text. The regular expression acts as a template that matches a character pattern with the searched string. Ordinary characters include uppercase and lowercase letters, numbers, punctuation and other symbols, while metacharacters refer to special characters that have a special meaning in regular expressions. It can be used to specify the occurrence pattern of its leading character (that is, the character before the metacharacter) in the target object.

Regular expressions are commonly used in scripting and text editors. Many text processors and programming languages support regular expressions, such as the common text processors (grep, egrep, sed, awk) in Perl and Linux systems mentioned earlier. Regular expression has a powerful function of text matching, which can process text quickly and efficiently in the ocean of text.

Underlying regular expression:

The string expression method of regular expression can be divided into basic regular expression and extended regular expression according to different degree of rigor and function. The underlying regular expression is the most basic part of a commonly used regular expression. In the common file processing tools in Linux systems, grep and sed support basic regular expressions, while egrep and awk support extended regular expressions. To master the use of basic regular expressions, you must first understand the meaning of metacharacters contained in basic regular expressions. Let's copy a configuration file for the http service to demonstrate.

[root@localhost] # cp / etc/httpd/conf/httpd.conf / opt/httpd.txt [root@localhost] # cat / opt/httpd.txt # # This is the main Apache HTTP server configuration file. It contains the# configuration directives that give the server its instructions.# See for detailed information.# In particular, see # # for a discussion of each configuration directive. .. / / omit part of the grep command (1) to find specific characters

Finding a specific character is very simple, such as executing the following command to find out the location of the specific character "do" from the httpd.txt file. Where "- n" indicates that the line number is displayed, and "- I" indicates that it is case-insensitive. After the command is executed, the font color changes to red for characters that match the criteria.

Reverse selection, such as finding lines that do not contain the "do" character, needs to be done through the "- vn" option of the grep command.

[root@localhost opt] # grep-vn 'do' httpd.txt 1 vn 2 This is the main Apache HTTP server configuration file. It contains the3:# configuration directives that give the server its instructions.5:# In particular, see 7:# for a discussion of each configuration directive.8:# . .. / / omit part of the content (2) use brackets "[]" to find collection characters

When you look for the strings "shirt" and "short", you can find that both strings contain "sh" and "rt". At this point, execute the following command to find both "shirt" and "short". No matter how many characters there are in "[]", they represent only one character, that is, "[io]" matches "I" or "o".

[root@localhost opt] # tail-3 httpd.txt shortshirtshart [root@localhost opt] # grep-n'sh [io] rt' httpd.txt 354:short355:shirt [root@localhost opt] #

To find a duplicate single character "oo", simply execute the following command.

[root@localhost opt] # head-5 httpd.txt woodwooodwoooodwoooood# [root@localhost opt] # grep-n 'oo' httpd.txt 1:wood2:woood3:wooood4:woooood

If you look for strings that are not preceded by "w" before "oo", you only need to do this by selecting "[^]" in the reverse direction of the collection characters. For example, executing the "grep-n'[^ w] oo'httpd.txt" command means looking for strings in httpd.txt text that are not preceded by "w" before "oo". The execution result shows that "woood" is still found, because "oo" is preceded by "o" and not "w", which meets the conditions and is the same as others.

If you don't want lowercase letters in front of "oo", you can use the "grep-n'[^ amurz] oo'httpd.txt" command, where "Amurz" represents lowercase letters and uppercase letters are represented by "Amurz".

Finding rows that contain numbers can be done with the "grep-n'[0-9] 'httpd.txt" command.

(3) find the beginning of the line "^" and the character "$" at the end of the line

The underlying regular expression contains two positioning metacharacters: "^" (the beginning of the line) and "$" (the end of the line). In the above example, there are many lines containing "the" when querying the "the" string, and if you want to query lines that begin with the "the" string, you can do so with the "^" metacharacter.

Queries that begin with lowercase letters can be filtered by the "^ [Amurz]" rule, lines that begin with uppercase letters can be filtered using "^ [Amurz]", and queries that do not begin with letters use the "^ [^ a-zA-Z]" rule.

[root@localhost opt] # grep-n'^ [a Murz] 'httpd.txt 1:wood2:woood3:wooood4:woooood358:short359:shirt360:shart [root@localhost opt] # grep-n' ^ [Amurz] 'httpd.txt 35:ServerRoot "/ etc/httpd" 46:Listen 8060:Include conf.modules.d/*.conf70:User apache71:Group apache

The function of the "^" symbol is different inside and outside the metacharacter set "[]" symbol, indicating reverse selection within the "[]" symbol and positioning the beginning of the line outside the "[]" symbol. Conversely, you can use the "$" locator if you want to find a line that ends with a particular character. For example, execute the following command to query rows that end with a decimal point (.). Because the decimal point (.) is also a metacharacter in regular expressions (which will be discussed later), you need to use the escape character "\" to convert characters with special meaning into ordinary characters.

When querying blank lines, execute the "grep-n'^ $'httpd.txt" command.

(4) find any character "." And the repeating character "*"

As mentioned earlier, the decimal point (.) in a regular expression is also a metacharacter that represents any character. For example, execute the following command to find a string of four characters that begins with w and ends with d.

In the above results, the "wood" string "w... d" matches the rule. If you want to query oo, ooo, ooooo, and so on, you need to use asterisk (*) metacharacters. It is important to note, however, that "*" represents the repetition of zero or more of the first single characters. " O * "means to have zero characters (that is, null characters) or greater than or equal to one" o "character. Because null characters are allowed, executing the" grep-n'o*'httpd.txt "command will print everything in the text. If it is "oo*", the first o must exist, and the second o must be zero or more o, so all materials that contain o, oo, ooo, ooo, etc., meet the standard. By the same token, if the query contains at least two strings of o or more, execute the command "grep-n characters' httpd.txt".

The query begins with a w and ends with a string of at least one o, which can be achieved by executing the following command.

The query begins with a w and ends with a dispensable string of characters in the middle.

Query the row of any number.

(5) find continuous character range "{}"

In the above example, we use "." With "*" to set zero to an infinite number of repeating characters, what if you want to limit repeating strings in a range? For example, if you look for consecutive characters of three to five o, you need to use the bounded character "{}" in the underlying regular expression. Because "{}" has a special meaning in Shell, when using the "{}" character, you need to use the escape character "\" to convert the "{}" character into a normal character. The use of the "{}" character is as follows.

Query the characters of two o.

The query begins with w and ends with d, with a string of 2'5 o in the middle.

The query begins with w and ends with d, with strings of more than 2 o in the middle.

(6) metacharacter summary

From the above simple examples, we can see that the metacharacters of common basic regular expressions mainly include the following:

Metacharacters act as ^ to match the beginning of the input string. Unless used in a square bracket expression, the character collection is not included. To match the "^" character itself, use "\ ^" $to match the end of the input string. If the Multiline property of the RegExp object is set, "$" also matches'\ n'or'\ r'. To match the "$" character itself, use "\ $". Matches any single character except "\ r\ n"\ | marks the next character as a special character, literal character, backward reference, octal escape character. For example,'n 'matches the character "n". '\ n' matches the newline character. The sequence'\ 'matches "\", while' ('matches "(" * matches the previous subexpression zero or more times. To match the "*" character, use the "*" [] character collection. Matches any of the characters contained. For example, "[abc]" can match the set of "a" [^] assigned characters in "plain". Matches any character that is not included. For example, "[^ abc]" can match any range of alphabetic [n1-n2] characters in "plin" in "plain". Matches any character in the specified range. For example, "[a Musz]" can match any lowercase character in the range of "a" to "z". Note: the range of characters can be represented only if the hyphen (-) is within the character group and occurs between two characters; if it appears at the beginning of the character group, it can only indicate that the hyphen itself {n} n is a non-negative integer. Match determined n times. For example, "o {2}" does not match the "o" in "Bob", but can match the two o {n,} n in "food" is a non-negative integer, matching at least n times. For example, "o {2,}" does not match "o" in "Bob", but does match all o in "foooood". " O {1,} "is equivalent to" o + "." O {0,} "is equivalent to" o * "{n CONFIGsed m} m} m and n are non-negative integers, where n $CONFIGsed-I-e'/ ^ local_enable/s/NO/YES/g'-e'/ ^ write_enable/s/NO/YES/g' $CONFIG grep" listen "$CONFIG | | sed-I'$alisten=YES' $CONFIG# starts the vsftpd service and is set to automatically run the systemctl restart vsftpdsystemctl enable vsftpd [root@localhost ~] # chmod + x local_only_ftp.shawk tool after boot

In Linux/UNIX system, awk is a powerful editing tool, which reads input text line by line, searches according to the specified matching pattern, formats and outputs or filters the content that meets the requirements, and can achieve quite complex text operations without interaction. It is widely used in Shell scripts to complete a variety of automatic configuration tasks.

(1) Common usage of awk

Typically, the command format used by awk is as follows, where single quotation marks and curly braces "{}" are used to set the processing action on the data. Awk can process the target file directly or through the "- f" read script.

Awk option 'mode or condition {edit instruction}' file 1 file 2 "/ filter and output contents of file symbol condition awk-f script file 1 file 2" / / call edit instructions from the script to filter and output content

As mentioned earlier, the sed command is often used to process an entire line, while awk tends to divide a line into multiple "fields" and then process it, and by default the field delimiter is a space or the tab key. The result of awk execution can be printed and displayed through the function of print. In the process of using the awk command, you can use the logical operator "& &" for "and", "|" for "or", "!" It means "not"; you can also perform simple mathematical operations, such as +, -, *, /,%, ^ for addition, subtraction, multiplication, division, remainder, and multiplier, respectively.

/ etc/passwd is a very typical format file in Linux system, and the fields are separated by ":". Most of the log files in Linux system are also format files. Extracting relevant information from these files is one of the daily work of operation and maintenance. If you need to find out the user name, user ID, group ID and other columns of / etc/passwd, execute the following awk command.

[root@localhost ~] # awk-F':'{print $1 1daemon 4}'/ etc/passwd root 0 0bin 1 1daemon 2 2 "/ / omit part of the content

Awk reads information from an input file or standard input, and like sed, information is read line by line. The difference is that awk treats a line in a text file as a record and a part (column) of a line as a field (field) in a record. To manipulate these different fields, awk borrows a method similar to location variables in shell, using $1, $2, and $3 to represent different fields in rows (records) sequentially. In addition, awk uses $0 to represent the entire line (record). Different fields are separated by specified characters. The default delimiter for awk is a space. Awk allows you to specify delimiters in the form of "- F delimiters" on the command line.

Awk contains several special built-in variables (which can be used directly) as follows:

FS: specifies the field delimiter for each line of text, which defaults to spaces or tab stops; NF: the number of fields in the currently processed row; NR: the row number (ordinal) of the currently processed row; $0: the entire line content of the currently processed row; $n: the nth field (nth column) of the currently processed row; FILENAME: filename processed; RS: data record delimited, defaults to\ n, that is, one record per behavior. Usage example

Output text by line

[root@localhost opt] # awk'{print} 'test.txt / / output all content, equivalent to cat test.txt1234567the this8this the90thenTHE this [root@localhost opt] # awk' {print $0} 'test.txt / / output all content, equivalent to cat test.txt1234567the this8this the90thenTHE this [root@localhost opt] # [root@localhost opt] # awk' NR==1,NR==3 {print} 'test.txt / output Line 1 / 3 content 123 [root@localhost opt] # awk' (NR > = 1) & & (NR)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.