Detailed introduction of re Module in python 07/12 Update SLTechnology News&Howtos

Detailed introduction of re Module in python

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "detailed introduction of re module in python". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The metacharacters of regular expressions are. ^ $*? {[] | ()

. represents any character

[] is used to match a specified character category. The so-called character category is the character set you want to match, and the characters in the character set can be understood as or.

If you put it at the beginning of the string, it means to take the wrong. [^ 5] represents characters other than 5. If ^ is not at the beginning of the string, it represents itself.

Metacharacters with repetition function:

* repeat 0 to infinite times for the previous character

Repeat 1 to infinite times for the previous character

? Repeat 0 to 1 times for the previous character

The number of repeats for the previous character ranges from m to n, where {0,} = *, {1,} =, {0 ~ 1} =?

{m} repeat m times for the previous character

\ d matches any decimal number; it is equivalent to class [0-9].

\ D matches any non-numeric character; it is equivalent to class [^ 0-9].

\ s matches any white space character; it is equivalent to the class [fv].

\ s matches any non-white space character; it is equivalent to the class [^ fv].

\ w matches any alphanumeric character; it is equivalent to the class [a-zA-Z0-9 _].

\ W matches any non-alphanumeric character; it is equivalent to the class [^ a-zA-Z0-9 _].

Regular expressions (REs,regex,regex pattens) is a small, highly specialized programming language embedded in the python development language and can be used through re modules. Of regular expressions

Pattern can be compiled into a series of bytecode and then executed by an engine written in C. Here is a brief introduction to the syntax of regular expressions

The regular expression contains a list of metacharacters (metacharacter) with the following values:. ^ $* +? {[]\ | ()

1. Metacharacter ([]), which is used to specify a character class. A character classes is a collection of characters (character) that you want to match. Characters (character) can be listed individually or separated by "-" to represent a range. For example, [abc] matches any character of an abc b or c, and [MARC] can also be expressed as a character interval-- [Amurc]. If you want to match a single uppercase letter, you can use [Amurz].

Metacharacters (metacharacters) do not work in character class, for example, [akm$] will match any of the characters "a", "k", "m", "$". Here the metacharacter (metacharacter) "$" is an ordinary character.

two。 Metacharacter [^]. You can use complementary sets to match characters that are not within the interval. This is done by using "^" as the first character of the category; "^" elsewhere simply matches the "^" character itself. For example, [^ 5] will match any character except "5". At the same time, outside [], the meta character ^ represents the beginning of a matching string, such as "^ ab+" for a string that begins with ab.

Give an example to verify

> > m=re.search ("^ ab+", "asdfabbbb")

> print m

None

> m=re.search ("ab+", "asdfabbbb")

> print m

> > print m.group ()

Abbbb

Re.match cannot be used in the above example, because match matches the beginning of a string, and we cannot verify that the metacharacter "^" represents the beginning of the string.

> > m=re.match ("^ ab+", "asdfabbbb")

> print m

None

> m=re.match ("ab+", "asdfabbbb")

> print m

None

# verify the meaning of "^" in different positions in the metacharacter [].

> re.search ("[^ abc]", "abcd") # "^" indicates inversion in the first character, that is, any character other than abc.

> m=re.search ("[^ abc]", "abcd")

> > m.group ()

'd'

> m=re.search ("[ABC ^]", "^") # if "^" is not the first character in [], then it is an ordinary character

> > m.group ()

'^'

However, there is such a question about the metacharacter "^". The official document http://docs.python.org/library/re.html has a sentence about the metacharacter "^", Matches the start

Of the string, and in MULTILINE mode also matches immediately after each newline.

What I understand is that "^" matches the beginning of the string and, in MULTILINE mode, also matches the newline character.

> m=re.search ("^ a\ w +", "abcdfa\ na1b2c3")

> > m.group ()

'abcdfa'

> m=re.search ("^ a\ w +", "abcdfa\ na1b2c3", re.MULTILINE)

> m.group () #

'abcdfa'

I think flag is set to re.MULTILINE, according to the above paragraph, it should also match the newline character, so there should be m.group with "a1b2c3", but the result is not, try with findall, you can find the result. So I understand here that there is nothing in group because the search and match methods are returned as soon as they match, rather than matching everything.

> m=re.findall ("^ a\ w +", "abcdfa\ na1b2c3", re.MULTILINE)

> > m

['abcdfa',' a1b2c3']

3. Metacharacter (\), metacharacter backslash. As a string letter in Python, different characters can be added after the backslash to indicate different special meanings.

It can also be used to cancel all metacharacters so that you can match them in the pattern. For example, if you need to match the characters "[" or "\", you can remove their special meaning with a backslash before them:\ [or\\

four. The metacharacter ($) matches the end of the string or before the newline at the end of the string. (in MULTILINE mode, "$" also matches before line breaks)

The regular expression "foo" matches both "foo" and "foobar" while "foo$" matches only "foo".

> re.findall ("foo.$", "foo1\ nfoo2\ n") # matches the newline character at the end of the string.

['foo2']

> re.findall ("foo.$", "foo1\ nfoo2\ n", re.MULTILINE)

['foo1',' foo2']

> m=re.search ("foo.$", "foo1\ nfoo2\ n")

> > m

> > m.group ()

'foo2'

> m=re.search ("foo.$", "foo1\ nfoo2\ n", re.MULTILINE)

> > m.group ()

'foo1'

It seems that re.MULTILINE has a great influence on $.

5. Metacharacter (*), matching 0 or more

6. Metacharacter (?), matching one or 0

7. Metacharacter (+), matching one or more

8, metacharacter (|), which means "or", such as A | B, where A _ () B is a regular expression that matches An or B.

9. Metacharacter ({})

{m}, used to represent the m-th copy of the previous regular expression, such as "a {5}", to indicate that five "a", or "aaaaa", are matched.

> re.findall ("a {5}", "aaaaaaaaaa")

['aaaaa',' aaaaa']

> re.findall ("a {5}", "aaaaaaaaa")

['aaaaa']

{m. N} is used to represent m to n times copy of the previous regular expression, trying to match as many copy as possible.

> re.findall ("a {2pm 4}", "aaaaaaaa")

['aaaa',' aaaa']

From the example above, you can see that the regular expression matches n rather than m first, because the result is not ["aa", "aa"]

> re.findall ("a {2}", "aaaaaaaa")

['aa',' aa']

{m,n}? Used to represent m to n times copy of the previous regular expression, trying to match as few copy as possible

> re.findall ("a {2pm 4}?", "aaaaaaaa")

['aa',' aa']

ten. The metacharacter ("()") is used to indicate the beginning and end of a group.

The more commonly used ones are (REs), (? PREs), which are unnamed groups and named group, and named group, which can be used through matchObject.group (name).

Get the matching group, while the unnamed group can get the matching group through the group sequence number starting at 1, such as matchObject.group (1). The specific application will be explained by an example in the following group () method

11. Metacharacter (.)

Metacharacter "." In default mode, matches all characters except newline characters. In DOTALL mode, matches all characters, including newline characters.

> import re

> re.match (".", "\ n")

> m=re.match (".", "\ n")

> print m

None

> m=re.match (".", "\ n", re.DOTALL)

> print m

> > m.group ()

'\ n'

Let's first take a look at the methods owned by the Match Object object. Here is a brief introduction to several commonly used methods.

1.group ([group1, …])

Returns one or more subgroups that match. If it is a parameter, the result is a string, and if it is more than one parameter, the result is a parameter and a tuple of item. The default value for group1 is 0 (all matching values will be returned). If the groupN parameter is 0, the corresponding return value is all matching strings, and if the value of group1 is [1... 99], then the string corresponding to the parenthesis group will be matched. If the group number is negative or larger than the group number defined in pattern, an IndexError exception is thrown. If the pattern does not match, but the group matches, then the value of group is also None. If a pattern can match more than one, then the group corresponds to the last style match. In addition, subgroups are distinguished from left to right based on parentheses.

> m=re.match ("(\ w +) (\ w +)", "abcd efgh, chaj")

> m.group () # matches all

'abcd efgh'

> m.group (1) # the subgroup of the first parenthesis.

'abcd'

> m.group (2)

'efgh'

> m.group (1, 2) # more than one parameter returns a tuple

('abcd',' efgh')

> m=re.match ("(? P\ w +) (? P\ w +)", "sam lee")

> m.group ("first_name") # use group to get subgroups containing name

'sam'

> m.group ("last_name")

'lee'

Now remove the parentheses.

> m=re.match ("\ w +\ w +", "abcd efgh, chaj")

> > m.group ()

'abcd efgh'

> > m.group (1)

Traceback (most recent call last):

File "", line 1, in

M.group (1)

IndexError: no such group

If a group matches multiple times, only the last match is accessible:

If a group matches more than one, only the last match is returned.

> m=re.match (r "(..) +", "a1b2c3")

> > m.group (1)

'c3'

> > m.group ()

'a1b2c3'

The default value of Group is 0, which returns the string to which the regular expression pattern matches.

> s= "afkak1aafal12345adadsfa"

> pattern=r "(\ d)\ w + (\ d {2})\ w"

> m=re.match (pattern,s)

> print m

None

> m=re.search (pattern,s)

> > m

> > m.group ()

'1aafal12345a'

> > m.group (1)

'1'

> m.group (2)

'45'

> m.group (1pm 2j0)

('1th,' 45th, '1aafal12345a')

two. Groups ([default])

Returns a tuple containing all subgroups. Default is used to set default values that do not match the group. Default defaults to "None"

> m=re.match ("(\ d +)\. (\ d +)", "23.123")

> > m.groups ()

('23rd,' 123')

> m=re.match ("(\ d +)\.? (\ d +)?", "24") # the second\ d here does not match, use the default value "None"

> > m.groups ()

('24 hours, None)

> m.groups ("0")

('249,' 0')

3.groupdict ([default])

Returns the dictionary of all named subgroups that match. Key is the name value, and value is the matching value. The parameter default is the default value for subgroups that do not match. The parameters here are the same as the groups () method. The default value is None

> m=re.match ("(\ w +) (\ w +)", "hello world")

> > m.groupdict ()

{}

> m=re.match ("(? P\ w +) (? P\ w +)", "hello world")

> > m.groupdict ()

{'secode':' world', 'first':' hello'}

As can be seen from the above example, groupdict () has no effect on subgroups without name

Regular expression object

Re.search (string [, pos [, endpos]])

Scan the string string for a location that matches the regular expression. If a match is found, a MatchObject object is returned (not all will be matched). Return None if it is not found.

The second parameter starts from that position in the string. The default is 0.

The third parameter, endpos, defines where the string can be found farthest. The default value is the length of the string. .

> m=re.search ("abcd", '1abcd2abcd')

> m.group () # returns a match object when it is found, and then finds the matching result according to the method of the object.

'abcd'

> > m.start ()

one

> > m.end ()

five

> re.findall ("abcd", "1abcd2abcd")

['abcd',' abcd']

Re.split (pattern, string [, maxsplit=0, flags=0])

Split the string with pattern. If pattern contains parentheses, then all groups in pattern will also be returned.

> re.split ("\ W+", "words,words,works", 1)

['words',' words,works']

> re.split ("[a murz]", "0A3b9z", re.IGNORECASE)

['0A3','9',']

> re.split ("[a murz] +", "0A3b9z", re.IGNORECASE)

['0A3','9',']

> re.split ("[a-zA-Z] +", "0A3b9z")

['0mm,' 3pm, '9pm,']

> re.split ('[a murf] +', '0a3B9 colors, re.IGNORECASE) # re.IGNORECASE is used to ignore case in pattern.

['0mm,' 3B9']

If the group is captured during split and matches the beginning of the string, the returned result will start with an empty string.

> re.split ('(\ W+)', '... words, words...')

[', 'words',' words',','']

> re.split ('(\ W+)', 'words, words...')

['words',' words', '...',']

Re.findall (pattern, string [, flags])

Returns all non-overlapping strings in string that match pattern as list. String scans from left to right, and the matching results are returned in the same order.

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

> re.findall ('(\ W+)', 'words, words...')

[',',']

> re.findall ('(\ W+) dwells, 'words, words...d')

['.]

> re.findall ('(\ W+) dwells, '... dwords, words...d')

['.',']

Re.finditer (pattern, string [, flags])

Similar to findall, except that list is returned, but an iterator is returned

Let's look at an example of sub and subn.

> re.sub ("\ d", "abc1def2hijk", "RE")

'RE'

> x=re.sub ("\ d", "abc1def2hijk", "RE")

> > x

'RE'

> re.sub ("\ d", "RE", "abc1def2hijk",)

'abcREdefREhijk'

> re.subn ("\ d", "RE", "abc1def2hijk",)

('abcREdefREhijk', 2)

Through the example, we can see the difference between sub and subn: sub returns the replaced string, while subn returns a tuple of the replaced string and the number of substitutes.

Re.sub (pattern, repl, string [, count, flags])

Replace pattern in the string string with repl. If the pattern does not match, the returned string does not change. Repl can be either a string or a function. If it is a string, if repl is a method / function. For all pattern matches to. He always calls this method / function. This function and method takes a single match object as an argument and then returns the replaced string. Here are the examples provided on the official website:

Def dashrepl (matchobj):... If matchobj.group (0) ='-': return''... Else: retu "detailed introduction of re modules in python" ends here. Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.