What are the POSIX specifications in Linux 07/08 Update SLTechnology News&Howtos

What are the POSIX specifications in Linux

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Linux which POSIX specifications, many novices are not very clear, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

POSIX specification

The common notation of regular expressions actually comes from Perl. In fact, regular expressions derive from Perl a prominent school called PCRE (Perl Compatible Regular Expression), which is characterized by notations such as "\ d", "\ w", "\ s" and so on. But in addition to PCRE, there are other schools of regular expressions, such as the regular expressions of the POSIX specification that will be introduced below.

The full name of POSIX is Portable Operating System Interface for uniX, which consists of a series of specifications that define the functions that the UNIX operating system should support, so "regular expressions of POSIX specifications" is actually only "POSIX specifications about regular expressions". It defines two major schools: BRE (Basic Regular Expression, basic regular expressions) and ERE (extended regular expressions). On POSIX-compatible UNIX systems, tools such as grep and egrep follow the POSIX specification, and regular expressions in some database systems also conform to the POSIX specification.

BRE

Among the common tools of Linux/Unix, grep, vi and sed all belong to the BRE school, and its syntax looks strange. The metacharacters "(", ")," {","} "must be escaped before they have a special meaning, so the regular expression" (a) b "can only match the string (a) b rather than the string ab. The regular expression "a {1pr 2}" can only match the string a {1pr 2}, and the regular expression "a\ {1pr 2\}" can match either the string an or aa.

The reason why it is so troublesome is that these tools were born very early, but many of the functions of regular expressions evolved gradually, and these metacharacters may not have a special meaning; to ensure backward compatibility, only escape can be used. And some features are not even supported at all, such as BRE does not support "+" and "?" quantifiers, nor does it support multi-choice structures. |...) "and backreference"\ 1 ","\ 2 "... no, no, no.

Today, however, pure BRE is rare. After all, it has been taken for granted that regular expressions support functions such as multi-selection structures and backreferences, which is not really too inconvenient. So although vi belongs to the BRE school, it provides these features. GNU also extends BRE to support "+", "?", "|", but must be written as "\ +", "\?", "\ |", and backreferences such as "\ 1" and "\ 2" are also supported. In this way, tools such as GNU's grep are nominally BRE streams, but the more exact name is GNU BRE.

ERE

Among the common tools of Linux/Unix, egrep and awk belong to the ERE school. Although the name of BRE is "basic" and the name of ERE is "extension", ERE does not require compatibility with BRE syntax, but is self-contained. Therefore, the metacharacters do not need to escape (adding a backslash before the metacharacters will cancel their special meaning), so "(ab | cd)" can match the string ab or cd, and the quantifiers "+", "?" and "{nMagne m}" can be used directly. ERE does not explicitly support backreferences, but many tools support backreferences such as "\ 1" and "\ 2".

Tools such as egrep produced by GNU belong to the ERE stream (the more accurate name is GNU ERE), but because GNU has done a lot of extensions to BRE, the so-called GNU ERE is actually just a statement, it has some functions GNU BRE has, but metacharacters do not need to escape.

The following table briefly illustrates the differences between several POSIX genres [1] (in fact, there is no functional difference between BRE and ERE today, the main difference is in the escape of metacharacters).

Descriptions of several POSIX genres

Genre description tools BRE (,), {,} must be escaped, and +,?, | grep, sed, vi (but vi supports these multiple selection structures and backreferences) GNUBRE (,), {,}, +,?, | must be escaped using GNU grep and GNU sedERE metacharacters without escape, +,?, (,), {,}, | can be used directly. Support for uncertain egrep and awkGNU ERE metacharacters in\ 1 and\ 2 is not necessary. +,?, (,), {,}, | can be used directly.\ 1,\ 2grep-E, GNU awk are supported.

For ease of reference, the following table lists the representation of basic regular functions in common tools, of which the version of the tool GNU prevails.

Representations in common Linux/Unix tools

PCRE acronym vi/vimgrepawksed*+\ + +\ +?\ =\??\? {mmagnetic n}\ {mlegal n}\ {mmagnetic n\}\ {mlegal n\}\ b *\

< \>

\ y\

< \>

(… |...) \ (… \ |... \)\ (... \ |... \) (... |...) (… |...) (...) \ (… \)\ (... \) (...) (...) \ 1\ 2\ 1\ 2\ 1\ 2 does not support\ 1\ 2

Note:\ b is commonly used in PCRE to represent "the beginning or end of a word", but in Linux/Unix tools,\ is usually used to match "the end of a word", and\ y in sed can match both positions.

POSIX character group

In some documents, you'll also find representations like "[: digit:]" and "[: lower:]" that don't look hard to understand (digit is "number" and lower is "lowercase"), but strangely, this is the POSIX character group. These character groups appear not only in the common tools of Linux/Unix, but even in some languages, and it is necessary to briefly introduce them here to avoid confusion.

In the POSIX specification, notations such as "[a murz]" and "[aeiou]" are still legal, and their meaning is no different from the character group in PCRE, except that the exact name of such notation is POSIX square bracket expression (bracket expression), which is mainly used in Unix/Linux systems. The main difference between the POSIX square bracket notation and the PCRE character group is that in the POSIX character group, the backslash\ is not used for escape. So the POSIX square bracket notation "[\ d]" can only match\ and d characters, not the numeric characters corresponding to "[0-9]".

In order to solve the escape problem of special meaning characters in the character group, the POSIX square bracket representation stipulates that if you want to express the character in the character group] (rather than as the closing mark of the character group), it should be immediately after the open square brackets of the character group, so in POSIX, the regular expression "[] a]" matches the characters] and a. If you want to express the character-(instead of the range representation) in the POSIX square bracket representation, it must be placed immediately before the closed square brackets], so the characters "[a -]" match are an and -.

The POSIX specification also defines the POSIX character group, which is approximately equivalent to PCRE's character group abbreviation, using an intuitive name to represent a set of characters, such as digit for "numeric characters" and alpha for "alphabetic characters".

However, there is another noteworthy concept in POSIX: locale (usually translated as "locale"). It is a set of language and culture-related settings, including date format, currency value, character coding, and so on. The meaning of POSIX character group will change according to the change of locale. The following table describes the meaning of common POSIX character group in ASCII language environment and Unicode language environment for your reference.

POSIX character group

POSIX character group description ASCII language environment Unicode language environment [: alnum:] * alphanumeric characters [a-zA-Z0-9] [\ p {L &}\ p {Nd}] [: alpha:] letter [a-zA-Z]\ p {L &} [: ascii:] ASCII characters [\ x00 -\ x7F]\ p {InBasicLatin} [: blank:] space characters and tabs [\ t] [\ p {Zs}\ t] [: cntrl:] Control character [\ x00 -\ x1F\ x7F]\ p {Cc} [: digit:] numeric character [0-9]\ p {Nd} [: graph:] characters other than white space [\ x21 -\ X7E] [^\ p {Z}\ p {C}] [: lower:] lowercase letter character [aanthz]\ p {Ll} [: print:] similar to [: graph:] But include white space characters [\ x20 -\ x7e]\ P {C} [: punct:] punctuation [] [! "# $% &'() * +,. /: ? @\ ^ _ `{|} ~ -] [\ p {P}\ p {S}] [: space:] the white space character [\ t\ r\ n\ v\ f] [\ p {Z}\ t\ r\ n\ v\ f] [: upper:] uppercase character [Amurz]\ p {Lu} [: word:] * alphabetic character [A-Za-z0-9] [\ p {L}\ p {N}\ p {Pc} ] [: xdigit:] hexadecimal character [A-Fa-f0-9] [A-Fa-f0-9]

Note 1: the character group abbreviation of the tag * is not in the POSIX specification, but it is used a lot, is provided in general languages, and also appears in documentation.

Note 2: for the corresponding Unicode attributes, please refer to the section on Unicode that has been published in this series.

The use of POSIX character groups is different. The main difference is that the PCRE character group abbreviation can appear directly without square brackets, while the POSIX character group must appear in square brackets, so it also matches numeric characters. When it appears alone, "\ d" can be written directly in PCRE, while the POSIX character group must be written as "[: digit:]".

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.