Linux/Unix tool and POSIX Specification of regular expressions 02/13 Update SLTechnology News&Howtos

Linux/Unix tool and POSIX Specification of regular expressions

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Readers with basic knowledge of regular expressions must be familiar with expressions such as "\ d" and "[amurz] +", the former matching a numeric character and the latter matching more than one lowercase letter. But if you have used tools under Linux/Unix such as vi, grep, awk, sed, etc., you may find that although these tools support regular expressions, the syntax is very different. Regular expressions such as "\ d" and "[a murz] +" are often either unrecognizable or mismatched. Moreover, there are differences between these tools themselves, the same structure, sometimes need to escape and sometimes do not need to escape. Why on earth is this? The reason is that most of the tools under Unix/Linux use the POSIX specification, at the same time, the POSIX specification can be divided into two schools (flavor). So, first of all, it is necessary to understand the POSIX specification.

I. POSIX specification

The common notation of regular expressions actually comes from Perl. In fact, regular expressions derive from Perl a prominent school called PCRE (Perl Compatible Regular Expression), which is characterized by notations such as "\ d", "\ w", "\ s" and so on. But in addition to PCRE, there are other schools of regular expressions, such as the regular expressions of the POSIX specification that will be introduced below. The full name of POSIX is Portable Operating System Interface for uniX, which consists of a series of specifications that define the functions that the UNIX operating system should support, so "regular expressions of POSIX specifications" is actually only "POSIX specifications about regular expressions". It defines two major schools: BRE (Basic Regular Expression, basic regular expressions) and ERE (extended regular expressions). On POSIX-compatible UNIX systems, tools such as grep and egrep follow the POSIX specification, and regular expressions in some database systems also conform to the POSIX specification.

1.1 BRE

Among the common tools of Linux/Unix, grep, vi and sed all belong to the BRE school, and its syntax looks strange. The metacharacters "(", ")," {","} "must be escaped before they have a special meaning, so the regular expression" (a) b "can only match the string (a) b rather than the string ab. The regular expression "a {1pr 2}" can only match the string a {1pr 2}, and the regular expression "a\ {1pr 2\}" can match either the string an or aa. The reason why it is so troublesome is that these tools were born very early, but many of the functions of regular expressions evolved gradually, and these metacharacters may not have a special meaning; to ensure backward compatibility, only escape can be used. And some features are not even supported at all, such as BRE does not support "+" and "?" quantifiers, nor does it support multi-choice structures. |...) "and backreference"\ 1 ","\ 2 "... no, no, no. Today, however, pure BRE is rare. After all, it has been taken for granted that regular expressions support functions such as multi-selection structures and backreferences, which is not really too inconvenient. So although vi belongs to the BRE school, it provides these features. GNU also extends BRE to support "+", "?", "|", but must be written as "\ +", "\?", "\ |", and backreferences such as "\ 1" and "\ 2" are also supported. In this way, tools such as GNU's grep are nominally BRE streams, but the more exact name is GNU BRE.

1.2 ERE

Among the common tools of Linux/Unix, egrep and awk belong to the ERE school. Although the name of BRE is "basic" and the name of ERE is "extension", ERE does not require compatibility with BRE syntax, but is self-contained. Therefore, the metacharacters do not need to escape (adding a backslash before the metacharacters will cancel their special meaning), so "(ab | cd)" can match the string ab or cd, and the quantifiers "+", "?" and "{nMagne m}" can be used directly. ERE does not explicitly support backreferences, but many tools support backreferences such as "\ 1" and "\ 2". Tools such as egrep produced by GNU belong to the ERE stream (the more accurate name is GNU ERE), but because GNU has done a lot of extensions to BRE, the so-called GNU ERE is actually just a statement, it has some functions GNU BRE has, but metacharacters do not need to escape. The following table briefly illustrates the differences between several POSIX genres [1] (in fact, there is no functional difference between BRE and ERE today, the main difference is in the escape of metacharacters). Descriptions of several POSIX genres

For ease of reference, the following table lists the representation of basic regular functions in common tools, of which the version of the tool GNU prevails. Representations in common Linux/Unix tools

Note: "\ b" is often used to indicate "the beginning or end of a word" in PCRE, but "\" is usually used to match "the end of a word" in Linux/Unix tools, and "\ y" in sed can match both positions.

II. POSIX character group

In some documents, you'll also find representations like "[: digit:]" and "[: lower:]" that don't look hard to understand (digit is "number" and lower is "lowercase"), but strangely, this is the POSIX character group. These character groups appear not only in the common tools of Linux/Unix, but even in some languages, and it is necessary to briefly introduce them here to avoid confusion.

In the POSIX specification, notations such as "[a murz]" and "[aeiou]" are still legal, and their meaning is no different from the character group in PCRE, except that the exact name of such notation is POSIX square bracket expression (bracket expression), which is mainly used in Unix/Linux systems. The main difference between the POSIX square bracket notation and the PCRE character group is that in the POSIX character group, the backslash\ is not used for escape. So the POSIX square bracket notation "[\ d]" can only match\ and d characters, not the numeric characters corresponding to "[0-9]".

In order to solve the escape problem of special meaning characters in the character group, the POSIX square bracket representation stipulates that if you want to express the character in the character group] (rather than as the closing mark of the character group), it should be immediately after the open square brackets of the character group, so in POSIX, the regular expression "[] a]" matches the characters] and a. If you want to express the character-(instead of the range representation) in the POSIX square bracket representation, it must be placed immediately before the closed square brackets], so the characters "[a -]" match are an and -.

The POSIX specification also defines the POSIX character group, which is approximately equivalent to PCRE's character group abbreviation, using an intuitive name to represent a set of characters, such as digit for "numeric characters" and alpha for "alphabetic characters". However, there is another noteworthy concept in POSIX: locale (usually translated as "locale"). It is a set of language and culture-related settings, including date format, currency value, character coding, and so on. The meaning of POSIX character group will change according to the change of locale. The following table describes the meaning of common POSIX character group in ASCII language environment and Unicode language environment for your reference.

POSIX character group

Note 1: the character group abbreviation of the tag * is not in the POSIX specification, but it is used a lot, is provided in general languages, and also appears in documentation.

Note 2: for the corresponding Unicode attributes, please refer to the section on Unicode that has been published in this series.

The use of POSIX character groups is different. The main difference is that the PCRE character group abbreviation can appear directly without square brackets, while the POSIX character group must appear in square brackets, so it also matches numeric characters. When it appears alone, "\ d" can be written directly in PCRE, while the POSIX character group must be written as "[: digit:]".

Generally speaking, the POSIX character group can be directly used in the tools under Linux/Unix, but most of the PCRE character group abbreviations such as "\ w", "\ d" and so on are not supported, so don't be surprised if you see "[: space:]]" instead of "\ s". However, in common programming languages, Java, PHP, and Ruby also support the use of POSIX character groups. The POSIX character groups in Java and PHP are matched according to the ASCII locale; the case of Ruby is more complicated. Ruby 1.8 matches according to the ASCII locale, and does not support "[: word:]" and "[: alnum:]". Ruby 1.9 matches according to the Unicode locale, and supports "[: word:]" and "[: alnum:]" at the same time.

Description: this is the end of a series of articles on regular expressions, and the author has recently completed a book on regular expressions, which explains various problems in the use of regular expressions in more detail and more comprehensively. The book is tentatively titled regular Guide and is expected to be available in the near future. Interested readers please follow us.

[1] for detailed specifications of ERE and BRE, please refer to http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html.

About the author

Yu Sheng, a programmer, used to be a senior consultant of Shengxiang.com, now works at Shanda Innovation Institute, interested in search and distributed algorithms. Translation enthusiasts, who have translated "proficient in regular expressions" (third edition) and "the Road of Technical leadership", are currently writing the regular expression Fool Book (tentatively named). Hope to contribute a practical regular expression course for domestic developers.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.