Introduction to the operation of Python regular expression instance 07/06 Update SLTechnology News&Howtos

Introduction to the operation of Python regular expression instance

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "introduction to Python regular expression instance operation". In daily operation, I believe many people have doubts about the introduction of Python regular expression instance operation. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "introduction to Python regular expression instance operation". Next, please follow the editor to study!

Python has added the re module since version 1.5, which provides Perl-style regular expression patterns. Prior to Python 1.5, Emacs-style mode was provided through the regex module. Emacs-style patterns are less readable and less functional, so try not to use regex modules when writing new code, although you may occasionally find them in old code.

1. Basis of regular expression

1.1. A brief introduction

Regular expressions are not part of Python. Regular expressions are powerful tools for dealing with strings, with their own unique syntax and an independent processing engine, which may not be as efficient as the methods that come with str, but they are very powerful. Thanks to this, the syntax of regular expressions is the same in languages that provide regular expressions, except that the number of grammars supported by different programming language implementations is different; but don't worry, unsupported syntax is usually an uncommonly used part. If you have already used regular expressions in other languages, you only need to take a quick look at them.

The following figure shows the process of matching using regular expressions:

The approximate matching process of a regular expression is to compare the expression with the characters in the text in turn, and if every character can match, the match is successful; once there are unsuccessful characters, the match fails. If there are quantifiers or boundaries in the expression, the process will be slightly different, but it is also easy to understand, as you can understand by looking at the example in the figure below and using it a few more times.

The following figure lists the regular expression metacharacters and syntax supported by Python:

1.2. Greedy mode and non-greedy mode of quantifiers

Regular expressions are often used to find matching strings in text. The default quantifier in Python is greedy (or, in a few languages, it may be non-greedy), always trying to match as many characters as possible; non-greedy, on the contrary, always trying to match as few characters as possible. For example: the regular expression "ab*" will find "abbb" if it is used to find "abbbc". If you use the non-greedy quantifier "ab*?", you will find "a".

1.3. The trouble of backslash

As with most programming languages, regular expressions use "\" as an escape character, which can cause backslash trouble. If you need to match the character "\" in the text, you will need four backslashes in the regular expression expressed in the programming language: the first two and the last two are used to escape into backslashes in the programming language. converted to two backslashes and then escaped into a backslash in the regular expression. The native strings in Python solve this problem well, and the regular expressions in this example can be represented by r "\". Similarly, the "\ d" that matches a number can be written as r "\ d". With native strings, you no longer have to worry about missing a backslash, and the expression is more intuitive.

1.4. Matching pattern

Regular expressions provide some available matching patterns, such as ignoring case, multiline matching, and so on, which are described together in the factory method re.compile (pattern [, flags]) of the Pattern class.

2. Re module

2.1. Start using re

Python provides support for regular expressions through the re module. The general step in using re is to first compile the string form of the regular expression into a Pattern instance, then use the Pattern instance to process the text and get the matching result (a Match instance), and finally use the Match instance to get the information and do something else.

# encoding: UTF-8import re# compiles regular expressions into Pattern objects pattern = re.compile (ringing hellos) # use Pattern to match text to get matching results. If there is no match, it will return Nonematch = pattern.match ('hello worldview') if match: # use Match to obtain grouping information print match.group () # output # hellore.compile (strPattern [, flag]):

This method is the factory method of the Pattern class and is used to compile regular expressions in the form of strings into Pattern objects. The second parameter, flag, is a matching pattern, which can be taken by bitwise or by the operator'|'to indicate that it is in effect at the same time, such as re.I | re.M. Alternatively, you can specify the pattern in the regex string, for example, re.compile ('pattern', re.I | re.M) is equivalent to re.compile (' (? im) pattern').

Available values are:

Re.I (re.IGNORECASE): ignore upper and lower case (complete writing in parentheses, same below)

M (MULTILINE): multiline mode, changing the behavior of'^ 'and' $'(see figure above)

S (DOTALL): point arbitrary matching pattern, change'.' The behavior of

L (LOCALE): make the predetermined character class\ w\ W\ b\ B\ s depending on the current locale

U (UNICODE): make the predetermined character class\ w\ W\ b\ B\ s\ S\ d\ D depend on the character attributes defined by unicode

X (VERBOSE): detailed mode. In this mode, regular expressions can be multi-line, ignore white space characters, and can be annotated.

The following two regular expressions are equivalent:

A = re.compile (r "\ d + # the integral part\. # the decimal point\ d * # some fractional digits", re.X) b = re.compile (r"\ d +\.\ d * ")

Re provides a number of modular methods for completing regular expressions. These methods can be replaced with the corresponding methods of the Pattern instance, with the only advantage of writing one less line of re.compile () code, but also unable to reuse the compiled Pattern object. These methods are described together in the instance methods section of the Pattern class. For example, the above example can be abbreviated as:

M = re.match (ringing hellograms, 'hello worldview') print m.group ()

The re module also provides a method escape (string) for using regular expression metacharacters such as * / + /? in string. Adding escape characters before returning is a little useful when you need to match a large number of metacharacters.

2.2. Match

The Match object is the result of a match and contains a lot of information about the match, which can be obtained using the readable properties or methods provided by Match.

Attributes:

String: the text used in the match.

Re: the Pattern object used in the match.

Pos: the index in the text where the regular expression starts the search. The value is the same as the parameters of the Pattern.match () and Pattern.seach () methods with the same name.

Endpos: the index of the regular expression in the text that ends the search. The value is the same as the parameters of the Pattern.match () and Pattern.seach () methods with the same name.

Lastindex: the index of the last captured packet in the text. If no packet is captured, it will be None.

Lastgroup: the alias of the last captured packet. If the grouping does not have an alias or is not captured, it will be None.

Methods:

Group ([group1, …]) :

Gets one or more grouped intercepted strings; it is returned as a tuple when multiple parameters are specified. Group1 can use either a number or an alias; the number 0 represents the entire matching substring; if the parameter is left empty, group (0) is returned; the group that does not intercept the string returns the group that has intercepted many times and None; returns the last intercepted substring.

Groups ([default]):

Returns all grouped intercepted strings in tuples. It is equivalent to calling group (1, 2, … Last). Default means that groups that do not intercept strings are replaced by this value, which defaults to None.

Groupdict ([default]):

Returns a dictionary with the alias of the alias group as the key and the substring intercepted by the group as the value. Groups without aliases are not included. The meaning of default is the same as above.

Start ([group]):

Returns the starting index of the substring intercepted by the specified group in string (the index of the first character of the substring). The default value for group is 0.

End ([group]):

Returns the end index of the substring intercepted by the specified group in string (the index of the last character of the substring + 1). The default value for group is 0.

Span ([group]):

Return (start (group), end (group)).

Expand (template):

Replace the matched packet into template and return. You can use\ id or\ g,\ g references grouping in template, but you cannot use the number 0. \ id is equivalent to\ g; but\ 10 will be considered the 10th grouping, and if you want to express the character'0' after\ 1, you can only use\ G0.

Import rem = re.match (r'(\ w +) (\ w +) (? P.K.)', 'hello worldview') print "m.string:", m.stringprint "m.re:", m.reprint "m.pos:", m.posprint "m.endpos:", m.endposprint "m.lastindex:", m.lastindexprint "m.lastgroup:", m.lastgroupprint "m.group (1jue 2):", m.group (1mer2) print "m.groups ():" M.groups () print "m.groupdict ():", m.groupdict () print "m.start (2):", m.start (2) print "m.end (2):", m.end (2) print "m.span (2):", m.span (2) print r "m.expand (r'\ 2\ 1\ 3'):" M.expand (r'\ 2\ 1\ 3') # output # m.string: hello worldview # m.re: # m.pos: "m.endpos: 1" m.lastindex: "m.lastgroup: sign# m.group (1): ('hello',' world') # m.groups (): ('hello',' world'" '!') # m.groupdict (): {'sign':'!'} # m.start (2): "m.end (2): 1" m.span (2): (6, 11) # m.expand (r'\ 2\ 1\ 3'): world hello!

2.3. Pattern

The Pattern object is a compiled regular expression that matches the text through a series of methods provided by Pattern.

Pattern cannot be instantiated directly and must be constructed using re.compile ().

Pattern provides several readable properties to get information about the expression:

Pattern: the expression string used when compiling.

Flags: the matching pattern used at compile time. Digital form.

Groups: the number of groups in the expression.

Groupindex: a dictionary whose key is the alias of a group with an alias in the expression and whose value is the corresponding number of the group. Groups without aliases are not included.

Import rep = re.compile (r'(\ w +) (\ w +) (? P.K.)', re.DOTALL) print "p.pattern:", p.patternprint "p.flags:", p.flagsprint "p.groups:", p.groupsprint "p.groupindex:", p.groupindex### output # p.pattern: (\ w +) (\ w +) (? P.K.) # p.flags: "p.groups:" p.groupindex: {'sign': 3}

Instance method [| re module method]:

Match (string [, pos [, endpos]]) | re.match (pattern, string [, flags]):

This method attempts to match the pattern; from the pos subscript of the string. If the pattern ends with a match, it returns a Match object; if the pattern does not match during the matching process, or if the match has reached endpos before the match is over, it returns None.

The default values for pos and endpos are 0 and len (string), respectively; re.match () cannot specify these two parameters, and the parameter flags is used to specify a matching pattern when compiling pattern.

Note: this method is not an exact match. When the pattern ends, if there are remaining characters in the string, it is still considered a success. To match exactly, you can add the boundary match'$'at the end of the expression.

For an example, see section 2.1.

Search (string [, pos [, endpos]]) | re.search (pattern, string [, flags]):

This method is used to find substrings in a string that can match successfully. Try to match pattern from the pos subscript of string, and return a Match object if it can still be matched at the end of pattern; if it cannot be matched, add pos to 1 and try to match again; if it cannot be matched until pos=endpos, None is returned.

The default values for pos and endpos are 0 and len (string), respectively; re.search () cannot specify these two parameters, and the parameter flags is used to specify a matching pattern when compiling pattern.

# encoding: UTF-8 import re # compiles the regular expression into a Pattern object pattern = re.compile (rhomboworld') # use search () to find matching substrings. If there is no matching substring, it will return None # in this example, using match () cannot successfully match match = pattern.search ('hello worldview') If match: # use Match to obtain packet information print match.group () # output # worldsplit (string [, maxsplit]) | re.split (pattern, string [, maxsplit]):

Split the string according to the substrings that can match and return to the list. Maxsplit is used to specify the maximum number of splits, not to specify that all will be split.

Import rep = re.compile (r'\ daddy') print p.split ('one1two2three3four4') # output # [' one', 'two',' three', 'four','] findall (string [, pos [, endpos]]) | re.findall (pattern, string [, flags]):

Search for string and return all matching substrings in a list.

Import rep = re.compile (r'\ daddy') print p.findall ('one1two2three3four4') # output # [' 1','2','3','4'] finditer (string [, pos [, endpos]]) | re.finditer (pattern, string [, flags]):

Search for string and return an iterator that sequentially accesses each matching result (Match object).

Import rep = re.compile (r'\ daddy') for m in p.finditer ('one1two2three3four4'): print m.group (), # output # 1 2 3 4sub (repl, string [, count]) | re.sub (pattern, repl, string [, count]):

Replace each matching substring in the string with repl and return the replaced string.

When repl is a string, you can use\ id or\ g,\ g reference grouping, but you cannot use the number 0.

When repl is a method, the method should take only one parameter (the Match object) and return a string for replacement (grouping can no longer be referenced in the returned string).

Count is used to specify the maximum number of replacements, and replace them all if not specified.

Import rep = re.compile (r'(\ w +) (\ w +)) s ='I say, hello worldview print p.sub (r'\ 2\ 1, s) def func (m): return m.group (1). Title () +'+ m.group (2). Title () print p.sub (func, s) # # output # say I, world hellographies # I Say, Hello Worldlings subn (repl, string [, count]) | re.sub (pattern, repl, string [ Count]): return (sub (repl, string [, count]), number of replacements). Import rep = re.compile (r'(\ w +) (\ w +)) s ='I say, hello worldview print p.subn (r'\ 2\ 1, s) def func (m): return m.group (1). Title () +'+ m.group (2). Title () print p.subn (func, s) # output # # ('say I, world hellographies, 2) # (' I Say, Hello Worldmates, 2) so far The study on "introduction to the operation of Python regular expression examples" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.