How to analyze Python regular expression re module 03/01 Update SLTechnology News&Howtos

How to analyze Python regular expression re module

2025-03-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article shows you how to analyze the Python regular expression re module, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Brief introduction

Regular expressions (regular expression) are patterns that match text snippets. The simplest regular expression is a normal string that matches itself. For example, regular expressions can match strings.

It is important to note that regular expressions are not a program, but a pattern for dealing with strings. If you want to use it to deal with strings, you must use tools that support regular expressions, such as awk, sed, grep in Linux, or programming languages Perl, Python, Java, etc.

Regular expressions come in many different styles, and the following table lists some metacharacters and descriptions for programming languages such as Python or Perl:

Re module

In Python, we can use the built-in re module to use regular expressions.

It is important to note that regular expressions use\ to escape special characters. For example, in order to match the string 'python.org',', we need to use the regular expression 'python\. Org', and the string of Python itself is also escaped with\, so the above regular expression should be written as' python\\ .org'in Python, which can easily get bothered by\ We recommend using the original string of Python, with an r prefix, and the above regular expression can be written as follows:

R'python\ .org'

The re module provides a number of useful functions to match strings, such as:

Compile function

Match function

Search function

Findall function

Finditer function

Split function

Sub function

Subn function

The general steps for using the re module are as follows:

Use the compile function to compile the string form of a regular expression into a Pattern object

Match the text through a series of methods provided by the Pattern object to get the matching result (a Match object)

* use the properties and methods provided by the Match object to obtain information and perform other operations as needed

Compile function

The compile function is used to compile regular expressions to generate a Pattern object, which is generally used in the following forms:

Re.compile (pattern [, flag])

Where pattern is a regular expression in the form of a string, and flag is an optional parameter that represents a matching pattern, such as ignoring case, multiline pattern, and so on.

Next, let's look at an example.

Import re

# compiling regular expressions into Pattern objects

Pattern = re.compile (r'\ daddy')

Above, we have compiled a regular expression into a Pattern object, and then we can use a series of pattern methods to match the text. Some common methods of Pattern objects are:

Match method

Search method

Findall method

Finditer method

Split method

Sub method

Subn method

Match method

The match method is used to find the header of a string (you can also specify the starting position), which is a match and is returned as soon as a matching result is found, rather than finding all matching results. Its general use is as follows:

Match (string [, pos [, endpos]])

Where string is the string to be matched, and pos and endpos are optional parameters that specify the start and end positions of the string. The default values are 0 and len (string length), respectively. Therefore, when you do not specify pos and endpos, the match method matches the header of the string by default.

When the match succeeds, a Match object is returned, and if there is no match, None is returned.

Take a look at the example.

> import re

> pattern = re.compile (r'\ dflowers') # is used to match at least one number

> m = pattern.match ('one12twothree34four') # Lookup header, no match

> print m

None

> m = pattern.match ('one12twothree34four', 2,10) # starting from the position of' e', there is no match.

> print m

None

> m = pattern.match ('one12twothree34four', 3,10) # starting from the position of' 1', it matches exactly

> print m # returns a Match object

> m.group (0) # 0 can be omitted

'12'

> m.start (0) # 0 can be omitted

three

> m.end (0) # 0 can be omitted

five

> m.span (0) # 0 can be omitted

(3, 5)

Above, a Match object is returned when the match is successful, where:

Group ([group1, …]) Method is used to get one or more grouped matching strings, and when you want to get the entire matching substring, you can directly use group () or group (0)

The start ([group]) method is used to obtain the starting position of the grouped matching substring in the entire string (the index of the substring * characters). The default value of the parameter is 0.

The end ([group]) method is used to obtain the end position of the grouped matching substring in the entire string (substring * * one character index + 1). The default value of the parameter is 0.

The span ([group]) method returns (start (group), end (group)).

Look at another example:

> import re

> pattern = re.compile (r'([a murz] +) ([a murz] +)', re.I) # re.I means ignore case

> m = pattern.match ('Hello World Wide Web')

> print m # matches successfully and a Match object is returned

> m.group (0) # returns the entire substring that matches successfully

'Hello World'

> m.span (0) # returns the index of the entire substring that matches successfully

(0,11)

> m.group (1) # returns * substrings with matched packets

'Hello'

> m.span (1) # returns the index of * substrings whose packets match successfully

(0,5)

> m.group (2) # returns the substring of the second packet that matches successfully

'World'

> m.span (2) # returns the substring of the second packet that matches successfully

(6, 11)

> m.groups () # is equivalent to (m.group (1), m.group (2),...)

('Hello',' World')

> m.group (3) # there is no third grouping

Traceback (most recent call last):

File "", line 1, in IndexError: no such group

Search method

The search method is used to find any position in a string. It is also a match, and is returned as soon as a matching result is found, instead of finding all matching results. Its general usage is as follows:

Search (string [, pos [, endpos]])

When the match succeeds, a Match object is returned, and if there is no match, None is returned.

Let's look at examples:

> import re

> pattern = re.compile ('\ dflowers')

> m = pattern.search ('one12twothree34four') # here it doesn't match if you use the match method

> > m

> > m.group ()

'12'

> m = pattern.search ('one12twothree34four', 10,30) # specify the string interval

> > m

> > m.group ()

'34'

> > m.span ()

(13, 15)

Let's look at another example:

#-*-coding: utf-8-*-

Import re

# compiling regular expressions into Pattern objects

Pattern = re.compile (r'\ daddy')

# use search () to find matching substrings. If no matching substrings exist, None will be returned.

# you cannot match successfully using match () here

M = pattern.search ('hello 123456 789')

If m:

# use Match to get grouping information

Print 'matching string:',m.group ()

Print 'position:',m.span ()

Execution result:

Matching string: 123456

Position: (6,12)

Findall method

Both the match and search methods above match once and are returned as soon as a matching result is found. Most of the time, however, we need to search the entire string and get all the matching results.

The findall method is used as follows:

Findall (string [, pos [, endpos]])

Findall returns all the matching substrings in the form of a list, or an empty list if there is no match.

Look at the example:

Import re

Pattern = re.compile (r'\ dwells') # find numbers

Result1 = pattern.findall ('hello 123456 789')

Result2 = pattern.findall ('one1two2three3four4', 0,10)

Print result1

Print result2

Execution result:

['123456', '789']

['1mm,' 2']

Finditer method

The behavior of the finditer method is similar to that of findall, which searches the entire string and gets all the matching results. But it returns an iterator that sequentially accesses each matching result (Match object).

Look at the example:

#-*-coding: utf-8-*-

Import re

Pattern = re.compile (r'\ daddy')

Result_iter1 = pattern.finditer ('hello 123456 789')

Result_iter2 = pattern.finditer ('one1two2three3four4', 0,10)

Print type (result_iter1)

Print type (result_iter2)

Print 'result1...'

For M1 in result_iter1: # M1 is a Match object

Print 'matching string: {}, position: {}' .format (m1.group (), m1.span ())

Print 'result2...'

For m2 in result_iter2:

Print 'matching string: {}, position: {}' .format (m2.group (), m2.span ())

Execution result:

Result1...

Matching string: 123456, position: (6,12)

Matching string: 789, position: (13,16)

Result2...

Matching string: 1, position: (3,4)

Matching string: 2, position: (7,8)

Split method

The split method splits the string into substrings that can match and returns the list. It is used in the following form:

Split (string [, maxsplit])

Maxsplit is used to specify the number of * splits, but not to specify that all of them will be split.

Look at the example:

Import re

P = re.compile (r'[\ s\,\;] +')

Print p.split ('a _ dh _ b _ *; c _ d')

Execution result:

['averse,' baked, 'crested,' d']

Sub method

The sub method is used to replace. It is used in the following forms:

Sub (repl, string [, count])

Where repl can be a string or a function:

If repl is a string, repl is used to replace each matching substring of the string and the replaced string is returned. In addition, repl can also refer to a grouping in the form of\ id, but cannot use the number 0

If repl is a function, this method should take only one parameter (the Match object) and return a string for substitution (grouping can no longer be referenced in the returned string).

Count is used to specify the maximum number of replacements, and replace them all if not specified.

Look at the example:

Import re

P = re.compile (r'(\ w +) (\ w +)')

S = 'hello 123, hello 456'

Def func (m):

Return 'hi' +' + m.group (2)

Print p.sub (r'hello world', s) # replace 'hello 123' and 'hello 456' with 'hello world''

Print p.sub (r'\ 2\ 1references, s) # reference grouping

Print p.sub (func, s)

Print p.sub (func, s, 1) # can be replaced at most once

Execution result:

Hello world, hello world

123 hello, 456 hello

Hi 123, hi 456

Hi 123, hello 456

Subn method

The subn method behaves similarly to the sub method and is also used for substitution. It is used in the following forms:

Subn (repl, string [, count])

It returns a tuple:

(sub (repl, string [, count]), number of substitutions)

The tuple has two elements, the * element is the result of using the sub method, and the second element returns the number of times the original string has been replaced.

Look at the example:

Import re

P = re.compile (r'(\ w +) (\ w +)')

S = 'hello 123, hello 456'

Def func (m):

Return 'hi' +' + m.group (2)

Print p.subn (r'hello world', s)

Print p.subn (r'\ 2\ 1mm, s)

Print p.subn (func, s)

Print p.subn (func, s, 1)

Execution result:

('hello world, hello world', 2)

('123 hello, 456 hello', 2)

('hi 123, hi 456, 2)

('hi 123, hello 456, 1)

Other functions

In fact, a series of methods for Pattern objects generated using the compile function correspond to most of the functions of the re module, but with slight differences in use.

Match function

The match function is used in the following form:

Re.match (pattern, string [, flags]):

Where pattern is the string form of a regular expression, such as\ dwords, [a Murz] +.

The match method of the Pattern object is used as follows:

Match (string [, pos [, endpos]])

As you can see, the match function cannot specify the interval of a string, it can only search the header. Take a look at the example:

Import re

M1 = re.match (r'\ dudes, 'One12twothree34four')

If m1:

Print 'matching string:',m1.group ()

Else:

Print 'M1 is:',m1

M2 = re.match (r'\ dudes, '12twothree34four')

If m2:

Print 'matching string:', m2.group ()

Else:

Print 'm2 is:',m2

Execution result:

M1 is: None

Matching string: 12

Search function

The search function is used in the following form:

Re.search (pattern, string [, flags])

The search function cannot specify the search interval of a string, and the usage is similar to the search method of the Pattern object.

Findall function

The findall function is used in the following form:

Re.findall (pattern, string [, flags])

The findall function cannot specify the search interval of a string, and the usage is similar to the findall method of the Pattern object.

Look at the example:

Import re

Print re.findall (r'\ dudes, 'hello 12345 789')

# output

['12345', '789']

Finditer function

The finditer function is used in a way similar to Pattern's finditer method, in the following form:

Re.finditer (pattern, string [, flags])

Split function

The split function is used in the following form:

Re.split (pattern, string [, maxsplit])

Sub function

The sub function is used in the following form:

Re.sub (pattern, repl, string [, count])

Subn function

The subn function is used in the following form:

Re.subn (pattern, repl, string [, count])

In which way?

As you can see from the above, there are two ways to use the re module:

Use the re.compile function to generate a Pattern object, and then use a series of methods of the Pattern object to match the text

Directly use functions such as re.match, re.search and re.findall to search for text matches.

Next, let's use an example to show these two methods.

Let's first look at the first usage:

Import re

# compile regular expressions into Pattern objects first

Pattern = re.compile (r'\ daddy')

Print pattern.match ('123,123')

Print pattern.search ('234,234')

Print pattern.findall ('345,345')

Take a look at the second usage:

Import re

Print re.match (r'\ dwindling, '123,123')

Print re.search (r'\ dcats, '234,234')

Print re.findall (r'\ dcats, '345,345')

If a regular expression needs to be used multiple times (such as\ d + above), it is often used in many situations. For the sake of efficiency, we should precompile the regular expression to generate a Pattern object, and then use a series of methods of the object to match the files that need to be matched. If you directly use re.match, re.search and other functions, each time a regular expression is passed in, it will be compiled once, and the efficiency will be greatly reduced.

Therefore, we recommend using the first usage.

Match Chinese

In some cases, we want to match the Chinese characters in the text. It should be noted that the unicode coding range of Chinese is mainly in [\ u4e00 -\ u9fa5]. This is mainly because this range is incomplete, such as not including full-width (Chinese) punctuation, but in most cases, it should be sufficient.

Suppose you now want to extract the Chinese from the string title = u 'Hello, hello, World', you can do this:

#-*-coding: utf-8-*-

Import re

Title = u 'Hello, hello, World'

Pattern = re.compile (ur' [\ u4e00 -\ u9fa5] +')

Result = pattern.findall (title)

Print result

Notice that we preceded the regular expression with two prefixes ur, where r indicates the use of the original string and u indicates the unicode string.

Execution result:

[u'\ u4f60\ u597dink, u'\ u4e16\ u754c']

Greedy matching

In Python, the default for regular matching is greedy matching (maybe not greedy in a few languages), that is, matching as many characters as possible.

For example, we want to find all the div blocks in the string:

Import re

Content = 'aatest1bbtest2cc'

Pattern = re.compile (ritual.

Result = pattern.findall (content)

Print result

Execution result:

['test1bbtest2']

Because regular matching is a greedy match, that is, as many matches as possible, it will also try to match to the right to see if there are longer substrings that can be successfully matched.

If we want to make a non-greedy match, we can add one as follows:

Import re

Content = 'aatest1bbtest2cc'

Pattern = re.compile (ritual. Thanks') # plus?

Result = pattern.findall (content)

Print result

Results:

['test1',' test2']

Summary

The general steps for using the re module are as follows:

Use the compile function to compile the string form of a regular expression into a Pattern object

Match the text through a series of methods provided by the Pattern object to get the matching result (a Match object)

Use the properties and methods provided by the Match object to obtain information and perform other operations as needed

Python's regular matching defaults to greedy matching.

The above is how to analyze the Python regular expression re module. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.