Example Analysis of regular expression re Module 04/28 Update SLTechnology News&Howtos

Example Analysis of regular expression re Module

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the regular expression re module example analysis, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.

Regular expression:

Official definition: a regular expression is a logical formula for manipulating a string, that is, a "regular string" is formed by predefined specific characters and a combination of these specific characters. This "regular string" is used to express a filtering logic for a string.

What is a regular expression: a set of rules-matching strings

When it comes to regularity, it is only related to strings. What we need to consider is the range of characters that can appear in the same position.

What regular expressions can do:

1. Check whether an input string is legal-- when the web development project form verifies that a ◦ user enters a content, we need to test it in advance

◦ can improve the efficiency of programs and reduce the pressure on servers.

2. Find all the content that conforms to the rules from a large file-Log Analysis\ crawler ◦ can efficiently and quickly find content that conforms to the rules from a large piece of text

Character group: [character group]

Various characters that may appear in the same location form a character group, represented by [] in regular expressions. A square bracket indicates only one character position

Characters are divided into many categories, such as numbers, letters, punctuation, and so on. If you now require a position that "there can only be one number", then the characters in this position can only be one of the 10 numbers 0, 1, 2. 9.

The character group describes all the possibilities that can appear in a location.

# accept scope, which can describe multiple ranges, and write in conjunction with it

# [abc] A square bracket indicates only one character position, matching an or b or c

# [0-9] match the numbers 0-9, and compare the ranges according to ASCII

# [amurz] matches all lowercase letters

# [Amurz] matches all uppercase letters

# [a-zA-Z] matches all uppercase and lowercase letters

# [0-9a-z]

# [0-9aMuzAMuzZZ]

Metacharacters:

Characters:

Rules for metacharacter matching content

. Match any character except the newline character

\ w match letters or numbers or underscores

\ s matches any blank character

\ d matching numbers

\ nmatch a newline character

\ t match a tab

\ b match the end of a word

^ matches the beginning of a string

$matches the end of a string

\ W match non-alphanumeric or underscore

\ d match non-numeric

\ s matches non-blank characters

A | b matches the character an or the character b

An expression | b expression matches the content of an or b expression. If matching an is successful, it will not continue to match b. Therefore, if the two rules overlap, always put the long one first.

() grouped that matches the expression in parentheses and also represents a group. Constrains the scope of a metacharacter, which only takes effect within ()

[] character group, matching characters in a character group

[^] non-character group, matching all characters except characters in the character group

The symbols that can help us represent matching content in regular expressions are metacharacters in regular expressions.

# [0-9]-- >\ d means to match an arbitrary digit digit

# [0-9aMuz Amurz ZZ]-- >\ w indicates matching numeric underscore word

# spaces-- >

# tab-- >\ t

# enter enter-- >\ n

# whitespace, tab and enter-- >\ s indicates all whitespace, including whitespace tab and enter

# [\ d]\ d indicates matching numbers

# [\ d\ D] [\ w\ W] [\ s\ S] to match all

# [^\ d] matches all non-numbers

# [^ 1] matches everything except the number 1

# [1-9]\ d match two-digit integers

# [1357]\ d matches the first two integers of 1, 3, 5, 5, 7.

Example 1: match multiple URLs:

Www\ .oldboy\ .com | www\ .baidu\ .com | www\ .jd\ .com | www\ .taobao\ .com #\. Means to cancel. The special significance of

Www\. (oldboy | baidu | jd | taobao)\ .com # constrains with () | the scope of the content described

Memory metacharacters: all indicate what can be matched. A metacharacter always represents the content in a character position.

#\ d\ w\ s\ t\ n\ D\ W\ S

# [] [^].

# ^ $

# | ()

Quantifier:

Usage instructions for quantifiers

* repeat 0 or more times, indicating 0 or more times {0,}

+ repeat 1 or more times, indicating 1 or more times {1,}

? Repeat 0 or 1 times, which means matching 0 times or 1 time {0jin1}

{n} repeat n times, indicating matching n times

{n,} repeat n or more times, indicating at least n matches

{n ~ m} repeat n to m times, indicating that at least n times are matched and m times at most

Example:

Match integer\ d +

Match decimal\ d +\.\ d +

Match integer or decimal:\ d +\.?\ d * # has a problem, such as 1. Will also be matched to-- > the role of the grouping:\ d + (\.\ d +)?

Example: matches the mobile phone number, the mobile phone number starts with 1, the second place is 3-9, a total of 11 digits

1 [3-9]\ d {9}

# judge whether the content entered by the user is legal. If the user enters the right pair, the result can be found. If the input is incorrect, the result cannot be found.

^ 1 [3-9]\ d {9} $

# find all the content that conforms to the rules from a large file

1 [3-9]\ d {9}

Escape character:

Characters that originally have a special meaning need to be escaped when it comes to expressing their own meaning.

. There is a special meaning, cancel the special meaning.

Some content with special meaning, put in the character group, will cancel its special meaning.

# only represents the symbol itself

[(). * +?] All content will cancel its special meaning in the character group.

# indicates: a murc (a minus c)

[a\-c]-represents the range in the character group, and if you do not want it to represent the range, you need to escape or put it at the front\ end of the character group.

There are two ways to cancel the special meaning of a metacharacter:

1. Precede this metacharacter with\

two。 Valid for some characters, put this metacharacter in the character group

# [. () +? *]

Greedy match:

1. Greedy matching: match as much as possible when the quantifier range allows. * x means to match any character any number of times and then stop backtracking algorithm: 2. Non-greedy (lazy) matching: always matches as little as possible within the range of quantifiers. The front *, +, etc. are all greedy matches, that is, matching as much as possible, followed by? The sign makes it an inert match. *? X means to match any character any number of times but stop as soon as it encounters x. +? X matches anything at least once it encounters x and stops metacharacter + quantifier +? Lazy matching several commonly used non-greedy matches: *? Repeat any number of times, but repeat as few times as possible. Repeat one or more times, but repeat as little as possible. Repeat 0 or 1 times, but repeat as little as possible {n ~ m}? Repeat n to m times, but repeat {n,} as little as possible? Repeat more than n times, but repeat as few examples as possible: match ID card number: 18 + 15 digits # 15 digits: the first digit is 1-9, a total of 15 digits [1-9]\ d {14} # 18 digits: the first digit is 1-9, the last digit is 0-9 or X A total of 18 bits [1-9]\ d {16} [\ dx] [1-9]\ d {16} [0-9x] # 1: [1-9]\ d {16} [0-9x] | [1-9]\ d {14} # find all the content that conforms to the rules from a large file. Means to match [1-9]\ d {16} [0-9x] if there is no match, match [1-9]\ d {14} ^ ([1-9]\ d {16} [0-9x] | [1-9]\ d {14}) $# detect whether an input string is legal # 2: simplify [1-9]\ d {14} (\ d {2} [\ dx])? # from a Find all the content that conforms to the rules in a large file ^ [1-9]\ d {14} (\ d {2} [\ dx])? $# to test whether an input string is legal. () indicates grouping. If you divide\ d {2} [\ dx] into a group, you can constrain their occurrence for 0-1 times as a whole. For example: rule: 1\ dmatch 3 content to be matched: 1243333344 match result: rule 1243: 1\ dmatch 3 content to be matched: 1243333344 match result: 12433333re module: # findall still matches according to the complete rule, only displays the matching content in parentheses. Take all those that meet the criteria and give priority to those in the grouping. Ret = re.findall ('9\ d\ dcalendar 1940ash93010uru') print (ret) # ['974qu,' 93010 uru'] ret = re.findall ('9 (\ d)\ DZJ') print (ret) # ['74th,' 3'] # search is still matched according to the complete rule, and the first content of the match is displayed, but we can get the specific grouping by passing parameters to the group method, that is, the content in (). Search only takes the first one that meets the criteria, and does not give priority to show that the result of this is a variable variable. The result of group () is exactly the same as that of the variable .group (0). The form of group (n) specifies to get the matching content ret = re.search ('9 (\ d) (\ d)'in the nth group). '19740ash93010uru') print (ret) # variable if ret: print (ret.group ()) # 974 # ret.group (0) 0 default does not write print (ret.group (1)) # 7 print (ret.group (2)) # why is packet priority not needed in search but needed in findall? Parentheses are added to extract what is really needed. Why use grouping? Put what we want into groups, if what we are looking for is in a complex environment, and what we are looking for does not have a prominent distinctive feature, it will even be mixed with unwanted cluttered data. at this time, we need to count all the data, then filter the data, and circle the regular expressions corresponding to the data we really need. So we can filter out the data we really need. # how to cancel grouping priority if the unwanted content has to be written in the grouping because of a last resort when writing regularization Cancel the priority display of this group through?:) cancel the priority display of this group # findall ret = re.findall ('(\ w +)', 'askh930s02391j192agsj') print (ret) # [' askh930s02391j192agsj'] # search ret = re.search ('(\ w +)', 'askh930s02391j192agsj') print (ret.group ()) # askh930s02391j192agsj print (ret.group (1)) # H2 print (ret.group (2)) # askh930s02391j192agsj# matches the first addition from exp The first subtraction, Ababb or Amurb, and calculate their results exp ='2-3 * (5 * 6) 'ret = re.search (' (\ d +) [+] (\ d +)', exp) print (ret) print (ret.group (1)) # 5 print (ret.group (2)) # 6 print (int (ret.group (1)) + int (ret.group (2)) # 11 # put Douban source code into douban.html Get the title of the movie: with open ('douban.html',encoding='utf-8') as f: content = f.read () ret = re.findall (' (. *?) (?) (?:\ sfolk.)?), content) print (ret) # except Farewell my Concubine, all the other movies are in Shawshank's redemption format. (. *) Which is the name of the movie to be displayed? Is it the flag of non-greedy matching (?:\ sgrouping?): cancel the priority display of this group and do not display all the empty characters between the two lines of code throughout (). The English name of the movie? This part appears 0 times or once # what is a crawler # get the source code of a web page through code What is needed is the content of the web page embedded in the source code-- regular expression # install the extension module File--Settings--Project Interpreter-- +-- find the package-Install Package import requests ret = requests.get ('https://movie.douban.com/top250?start=0&filter=') print (ret.content.decode (' utf-8')) Thank you for reading this article carefully I hope the article "sample Analysis of regular expression re Module" shared by the editor will be helpful to you. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.