How to understand and master python regular expressions and re modules 04/17 Update SLTechnology News&Howtos

How to understand and master python regular expressions and re modules

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to understand and master python regular expressions and re modules". In daily operation, I believe many people have doubts about how to understand and master python regular expressions and re modules. I have consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the questions of "how to understand and master python regular expressions and re modules". Next, please follow the editor to study!

1. Regular expression

Regular expression, also known as regular expression. (English: Regular Expression, often abbreviated as regex, regexp or RE in code), a concept in computer science. Regular expressions are often used to retrieve and replace text that conforms to a certain pattern (rule).

The following are common usage scenarios for regular expressions:

Check the validity of the string

Verify the user name (amurz.0-9, not all numbers, not all letters)

Verify mailbox format (xxx@qq.com)

Verify phone number (11 digits)

Verify ID card (18 digits)

Verify the number format (5-12 pure digits, the first digit cannot be 0)

Extract information from a string

Extract a number from a text message

Extract the suffix of the file name

Collector (web crawler)

Replacement string

Replace illegal characters in a string

Block the phone number; (1852 telephone number 0102)

Replace the placeholder "hello {{name}}" hello (template frame)

Split string

Split a string according to the specified rules

In popular terms, the function of a regular is to retrieve a specific form of a string, and the object is a string.

1.1 metacharacter

Use metacharacters to match a single character

Character function

. match any 1 character (except\ n)

[] matches the characters listed in []

\ d matching numbers, that is, 0-9

\ d matches non-numeric, that is, not numeric

\ s match white space, that is, space, tab key

\ s matching is not blank

\ w matches word characters, namely amurz, Amurz, 0-9, _

\ W matches non-word characters

* match the previous character for 0 or unlimited times, which is optional

+ match the previous character once or indefinitely, that is, at least once

Import re

Text ='

This is the string used to match.

From:1427319758@qq.com

Tel:88888888

Demonstrate the regular matching of metacharacters for the above strings

Use points. Match any character

Res = re.findall ('.', text)

Print (res)

Run the result (note that the list is returned):

['this', 'yes', 'use', 'come', 'match', 'match', 'character', 'character', 'string', 'faint,' ringing, 'oasis,' masking,':','1','4','2','7','3','1','9','7','5','8','@' 'Qothers,' Qzone,'., 'canals,' oaths, 'masks,' tweets, 'eBay,' lags,':, '8percent,' 8percent, '8percent,' 8percent, '8percent,' 8percent, '8percent,' 8percent,'8']

\ d matching numbers

Res = re.findall ('\ daddy dongjue text)

Print (res)

Running result:

[1, 4, 2, 7, 3, 1, 9, 7, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]

+ * matches multiple characters

Res = re.findall ('\ dflowers, girls, etc.)

Res_1 = re.findall ('\ dflowers, girls, etc.)

Print (res,res_1)

Running result:

['1427319758,' 88888888']

[', '1427319758,' 8888888888,' ']

* matching the previous character can be 0 times, so\ d* matches each character that is not a number, that is, empty. \ d + equals\ d\ d *

[] (character set)

It is artificially stipulated that you can only match characters that appear in the character set. If you want to find the qq number in the string, the qq number cannot start with 0, and the number of digits is 5-12.

It can be restricted by [1-9]\ d {4J 11}.

Mailboxes may appear as strings. You can use [1-9a-zA-Z]\ w + [@]\ w + [.] [a-zA-z] + to match mailboxes in any regular format.

Res = re.findall ('[1-9a-zA-Z]\ w + [@]\ w + [.] [a-zA-z] +',''

At the 1427319758@163.com meeting,

Whether or not 1427319758@edu.cn

1427319758@xx.mail Assad

Shunfeng scored a single shot on asdfbglsafhlf.

'')

Print (res)

Running result:

['1427319758, 163.compositions,' 1427319758, '1427319758, xx.mail']

Use. * to match any number of characters

Res = re.findall ('. *, text)

Print (res)

Running result:

['', 'this is the string used to match',', 'from:1427319758@qq.com',', 'tel:88888888',',']

Because the re.findall () function encounters the newline character'\ n' by default, it terminates the current match, that is, it does not match the newline character, and each line matches separately, so an empty element appears.

1.2 quantifier

Use quantifiers to match multiple characters

Character function

{m} matches the previous character m times

{mdirection n} matches the previous character from m to n

Limit the number of times matching characters appear

Res = re.findall ('[1-9]\ d {4jc11}', text)

Res_1 = re.findall ('([1-9]\ d {4pm 11}) @', text)

Print (res,res_1)

Running result:

['1427319758', '88888888'] ['1427319758']

The first method matches the mobile phone number that also meets the rules, and the second way is to consider that the qq number is hidden in the email address, so add @ after it to limit the matching area, that is, only match the 5-12 digits that are not 0 digits before @.

1.3 exact matching and universal matching

Universal matching

Universal matching is to match everything, including feature characters.

Res = re.findall ('Hello.*like','Hello world! I like pythonium')

Print (res)

Running result:

['Hello world! I like']

Exact matching

An exact match is to match what is in parentheses.

Res_1 = re.findall ('Hello (. *) like','Hello world! I like pythonium')

Print (res_1)

Running result:

['world! I']

I want the character between Hello and like. The result of universal matching will contain the first and last feature characters Hello and like;, while the exact match will only get the string between the feature characters.

1.4 greedy matching and non-greedy matching

The default quantifier in Python is greedy (or non-greedy in a few languages) and always tries to match as many characters as possible.

Non-greed, on the contrary, always try to match as few characters as possible.

Add? after "*", "?", "+", "{mrecoery n}" to make greed become non-greed.

Res = re.findall ('Hello (. *) like','Hello world! I like python! I like youthful')

Res_1 = re.findall ('Hello (. *?) like','Hello world! I like python! I like youthful')

Print (res,res_1)

Running result:

['world! I like python! I']

['world! I']

Res returns greedy matching. After finding Hello, it matches as many characters as possible, and then stops at the last like.

Res_1 returns a non-greedy match, which matches only the characters before the first like when it finds Hello.

2. Re module

We have always used the re.search () function, but in fact, there are some functions in the regular expression module that can easily manipulate strings. The use of the re module can be divided into two types: the first is the object approach, and the second is the functional approach.

2.1 re.match

Match () is used to find the header of the string (you can also specify the starting position), which is a match and is returned as soon as a matching result is found, rather than finding all matching results. Its general use is as follows:

Match (pattern, string [, flag])

Where pattern is the regular expression regular string, string is the string to be matched, and flag is the optional parameter.

When the match succeeds, a Match object is returned, and if there is no match, None is returned.

Import re

Pattern = 'Python'

String = 'dsgfaPythonahsdgjasghPythonasdjajsk'

Result = re.match (pattern,string)

Result_1 = re.match (pattern,string [5:])

Print (result,result_1)

Running result:

None

2.2 re.search

Search () is used to find any position in a string. It is also a match, and is returned as soon as a matching result is found, instead of finding all matching results. Its general use is as follows:

Search (pattern, string [, flag])

When the match succeeds, a Match object is returned, and if there is no match, None is returned.

Ret = re.search ('\ dflowers, "python = 9999, c = 7890, C++ = 12345")

Print (ret.group ())

Running result:

9999

2.3 re.findall draw the key points!

Both the match and search methods above match once and are returned as soon as a matching result is found. Most of the time, however, we need to search the entire string and get all the matching results. The use of findall () is as follows:

Findall (pattern, string [, flag])

Findall () returns all the matching substrings in the form of a list, or an empty list if there is no match.

Ret = re.findall (r "\ d +", "python = 9999, c = 7890, C++ = 12345")

Print (ret)

Running result:

['9999', '7890', '12345']

2.4 re.split

Split () splits the string according to the matching substring and returns the list. It is used in the following form:

Split (pattern, string [, maxsplit, flags])

Where maxsplit is used to specify the maximum number of splits and does not specify that all will be divided.

Split ():

Split the string and remove the matching string.

The result is in list form

Maxsplit: 0 by default means all cut

1 means to cut once.

2 means cut twice.

Pattern ='\ dcats'

String = 'Pythonasdkjasd464654adhuiaghsdk564654akjsdhkashdkja'

Result = re.split (pattern,string,2)

Print (result)

Running result:

['Pythonasdkjasd',' adhuiaghsdk', 'akjsdhkashdkja']

It is actually an upgraded version of string.split for string manipulation.

2.5 re.sub

Sub () is used as a replacement, as follows:

Sub (pattern, repl, string [, count, flags])

The first parameter is the corresponding regular expression, the second parameter is the string to be replaced, the third parameter is the source string, and the fourth parameter is optional, which represents the maximum number of substitutions. If you ignore and do not write, all the results that match the pattern will be replaced.

Pattern = 'Java'

Repl ='*'

String = 'kjasdJavaadhuiaghsdkJavaakjsd'

Result = re.sub (pattern,repl,string,1)

Print (result)

Running result:

Kjasd*adhuiaghsdkJavaakjsd

An upgraded version of string.replace

Parameter flags

Method 1:

Ret = re.sub ("\ d +", '18cm, "age = 12")

Print (ret)

Running result:

Age = 18

Method 2 uses functions:

The essence of re.sub () is to retrieve the substring in accordance with the pattern format in the string, and then input the substring as a parameter to repl. The default repl function can be regarded as a function, that is, no matter what the substring I input, the output is replaced by the repl parameter. Here repl cannot have a relationship with the substring.

Def replace (string,repl):

Return repl

If we want to have a relationship between the output repl and the substring, for example, outputting the phone number 15654862043 in the string to 1565 repl string cannot be achieved simply by setting a repl string, we need to pass a function in the string.

Import re

Text ='

15654561654

13905641750

15646575635

18976534547

Def replace (string):

String = string.group ()

Repl = string [0:4] +'*'+ string [- 4RV muri 1]

Return repl

Ret = re.sub ("\ d +", repl = replace, string = text)

Print (ret)

Running result:

1565 years old. 165.

1390 yuan, 175 yuan.

1564 million dollars 563

1897 / 1897 / 454

2.6 re.compile

Use the compile () function to compile the string form of a regular expression into a Pattern object. The text is matched by a series of methods provided by the object, and the matching result (Match object) is obtained. Compilation can achieve more efficient matching search and so on.

Compile () function

The compile () function is used to compile regular expressions to generate a Pattern object, which is generally used as follows:

Import re

# compiling regular expressions into Pattern objects

Pattern_1 = re.compile ('\ dflowers, re.S)

Pattern_2 = re.compile ('\ Delimitation, re.l)

Pattern_3 = re.compile ('\ wicket, re.S)

The previous definition of pattern does not include the flags parameter, so it is possible to use only an assignment statement to make pattern ='\ dholder 'without re.compile. The advantage of the compile function is: 1. Can include flags parameters; 2. Form a module for subsequent reuse

Results1 = re.findall (pattern_1, '540775360roomqq.com')

Results2 = re.findall (pattern_2, "python = 9999, c = 7890, C++ = 12345")

Results3 = re.findall (pattern_3, "python = 997")

Print (results1, results2, results3)

2.7 Native string

> mm = "c:\\ a\\ b\\ c"

> mm

'c:\\ a\\ b\\ c'

> print (mm)

C:\ a\ b\ c

> re.match ("c:\", mm) .group ()

'c:\\'

> ret = re.match ("c:\", mm) .group ()

> print (ret)

C:\

> ret = re.match ("c:\ a", mm) .group ()

> print (ret)

C:\ a

> ret = re.match (r "c:\\ a", mm) .group ()

> print (ret)

C:\ a

> ret = re.match (r "c:\ a", mm) .group ()

Traceback (most recent call last):

File "", line 1, in

AttributeError: 'NoneType' object has no attribute' group'

How much does it cost to have an abortion in Zhengzhou http://www.kdwtrl.com/

In Python, the string is preceded by r to indicate the original string.

As with most programming languages, regular expressions use "\" as an escape character, which can cause backslash trouble. If you need to match the character "\" in the text, you will need four backslashes in a regular expression expressed in a programming language: the first two and the last two are used to escape to a backslash in the programming language. converted to two backslashes and then escaped into a backslash in the regular expression.

Native strings in Python solve this problem well. With native strings, you no longer have to worry about missing a backslash, and the expression is more intuitive.

> ret = re.match (r "c:\\ a", mm) .group ()

> print (ret)

C:\ a

2.8 match the beginning and end

Character function

^ matches the beginning of the string

Match the end of the string

End matching

Requirement: match the email address of 163.com

# coding=utf-8

Import re

Email_list = ["xiaoWang@163.com", "xiaoWang@163.comheihei", ".com.xiaowang @ qq.com"]

For email in email_list:

Ret = re.match ("[\ w] {4Jing 20} @ 163.com", email)

If ret:

Print ("% s is a compliant email address, and the matching result is:% s"% (email, ret.group ()

Else:

Print ("s does not meet requirements" email)

Running result:

XiaoWang@163.com is the specified email address, and the matching result is: xiaoWang@163.com

XiaoWang@163.comheihei is the specified email address, and the matching result is: xiaoWang@163.com

.com.xiaowang @ qq.com does not meet the requirements

After perfection

Email_list = ["xiaoWang@163.com", "xiaoWang@163.comheihei", ".com.xiaowang @ qq.com"]

For email in email_list:

Ret = re.match ("[\ w] {4Jing 20} @ 163,.com $", email)

If ret:

Print ("% s is a compliant email address, and the matching result is:% s"% (email, ret.group ()

Else:

Print ("s does not meet requirements" email)

Running result:

XiaoWang@163.com is the specified email address, and the matching result is: xiaoWang@163.com

XiaoWang@163.comheihei does not meet the requirements

.com.xiaowang @ qq.com does not meet the requirements

This example is for demonstration only, and it doesn't make sense to match mailboxes. Because it does not match multiple lines of strings that do not end with com but contain mailbox information, such as

XiaoWang@163.com

XiaoKang@163.com

This is the mailbox.

Omnipotent regularity

(. *) Matches any string except newline. No matter how long or short it is, match at most once, not greedy match.

This regular expression can solve most of the data you want to extract. You can try this combination first when writing regular expressions, and you may get twice the result with half the effort. And often combined with the re.findall () function.

2.9 case: crawling Movie Paradise data

The idea of movie paradise:

1. Go to the latest movies more-> more first page

two。 Turn the page https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html

1. > extract the details page URL of each page of the data movie

2. > send a request and get a response

3. > regular extraction links

4. > Save data (file)

Try to be familiar with the layout and structure of the web page before crawling! Familiar with the relationship of the URL, often find the data in the source code of the page (Ctrl+F).

Import re

Import requests

For page in range (1,5):

Url_list = f 'https://www.dytt8.net/html/gndy/dyzz/list_23_{page}.html'

# find the URL of the details page and go to the list page first

R_list = requests.get (url_list)

# specify the encoding

R_list.encoding = 'gb2312'

# extract the URL of the details page and return the list

Url_detail = re.findall ('', r_list.text)

For u in url_detail:

Url = 'https://www.dytt8.net' + u

# print (url)

# send a request again to get a response from the details page

Response = requests.get (url)

# will also be garbled.

Response.encoding = 'gb2312'

# extracting data

Result = re.findall ('. *?', response.text) [0:]

Print (result)

Try:

With open ('dytt.txt',' asides, encoding='utf-8') as fp:

# write can only be string and binary can not write dictionary list, etc.

Fp.write (result [0] +'\ n')

Except:

Print ('No data extracted!')

Song download:

You can climb VIP songs when you can see them.

Train of thought:

1. Grab the bag and find the page turning http://www.htqyy.com/genre/musicList/3?pageIndex=6&pageSize=20&order=hot.

two。 Go to the above URL to extract the song id

3. Download the song http://f2.htqyy.com/play7/{id}/mp3/1

Import re

Import requests

For page in range (1,3): # 1,2

# turning the page

Url_song = f 'http://www.htqyy.com/genre/musicList/3?pageIndex={page}&pageSize=20&order=hot'

# send a request to get a response and extract the song ID in the response

Response_song = requests.get (url_song)

# extract ID return list

Id_songs = re.findall ('value= "(\ d+)" >

# iterate through the id of the song and download it

For ids in id_songs:

Song_url = 'http://f2.htqyy.com/play7/{}/mp3/1'.format(ids)

Try:

# request the URL of the song and get a response

Response = requests.get (song_url, timeout=5)

# Save the song

With open (f'{ids} .mp3', 'wb') as fp:

Fp.write (response.content)

Except:

Print (f 'this song {ids} error')

At this point, the study on "how to understand and master python regular expressions and re modules" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.