What is a Python regular expression and how to use it 07/16 Update SLTechnology News&Howtos

What is a Python regular expression and how to use it

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what is Python regular expression and how to use it". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Re

Let's first introduce the method under the re module, which we can use with it, of course, if we know a little bit about the regular expression, if we don't know the regular expression at all, we can take a look at the regular expression section below.

1.1 match

The match method matches a pattern from the beginning of the string, and returns None if the match is not successful match

Re.match (pattern, string, flags=0)

Pattern: regular expression string: string to be matched flags: match pattern (whether case-sensitive, single-line or multi-line matching)

Match returns a re.Match object, and the methods in Match are described in more detail later.

Import recontent = "Cats are smarter than dogs" # the first parameter is a regular expression, and re.I means to ignore case match = re.match (r'(cats)', content, re.I) print (type (match)) print (match.groups ()) match = re.match (ringing dogsgiving, content, re.I) print (type (match)) # print (match.groups ())

Match is mainly used to capture packets, so try to use grouping mode, otherwise you can't get the result if it matches. If flag is re.I, it means ignoring case.

Another very important point is that match will only find the first matching group:

Import recontent = "aa aa smarter aa dogs" match = re.match (r'(aa)', content, re.I) if match: print (match.groups ())

The output above is: ('aa',)

1.2 search

Scanning the entire string and returning the first successful match, search differs from match in that search does not force a match from the beginning.

Re.search (pattern, string, flags=0) import recontent ='+ 123abc456*def789ghi'#\ w can match [a-zA-Z0-9], + means to match at least once reg = r "\ w +" match = re.search (reg, content) if match: print (match.group ()) 1.3 sub

Replace a match in a string

Re.sub (pattern, repl, string, count=0, flags=0)

Pattern: regular expression repl: the string to be replaced, which can be the function string: the string to be found and replaced count: the maximum number of substitutions after pattern matching. Default 0 means replace all matches. Optional flags: optional parameter, matching pattern. Default is 0.

Replace harmonious characters:

Import recontent = "do something fuck you" rs = re.sub (racy motherfucker, "*", content) print (rs)

Very simple, replace fuck with *

Now the demand has changed. how many characters of the words we need to be blocked will be replaced with several *? what should we do?

Import redef calcWords (matched): num = len (matched.group ()) return str (num *'*) content = "do something fuck you" rs = re.sub (rushing fucks, calcWords, content) print (rs)

The replacement string can use a function, and we can easily calculate it in the function.

1.4 findall

Finds all the substrings matched by the regular expression in the string and returns a list, or an empty list if no matches are found.

Re.findall (pattern, string, flags=0)

Pattern: regular expression string: string to be matched flags: optional parameter, matching pattern. Default is 0.

Import recontent ='+ 123abc456*def789ghi'reg = r "\ d +" rs = re.findall (reg, content) # ['123,' 456, '789] print (rs)

One of the manic things about findall is that if there is a grouping in the regular expression, only the matches in the grouping are returned.

Import recontent ='+ 123abc456*def789ghi'reg = r "\ d + ([amerz] +)" rs = re.findall (reg, content) # ['abc',' ghi'] print (rs) 1.5 finditer

Find all the substrings matched by the regular expression in the string and return them as an iterator

Re.finditer (pattern, string, flags=0)

Pattern: regular expression string: string to be matched flags: optional parameter, matching pattern. Default is 0.

Import recontent ='+ 123abc456*def789ghi'reg = r "\ d +" rss = re.finditer (reg, content) # 123 456 789 for rs in rss: print (rs.group (), end='')

Finditer is similar to findall, but without findall the manic problem of only returning grouping if there is a grouping.

Import recontent ='+ 123abc456*def789ghi'reg = r "\ d + ([amerz] +)" rss = re.finditer (reg, content) # 123abc 789ghifor rs in rss: print (rs.group (), end='') 1.6 split

The string is divided according to the matching substring and returned to the list

Re.split (pattern, string, maxsplit=0, flags=0) import recontent ='+ 123abc456*def789ghi'reg = r "\ d +" rs = re.split (reg, content) print (rs) 1.7compile

Compile the regular expression to generate a regular expression Pattern object. The previous methods call this method first to get a Pattern object, and then use the Pattern object's method of the same name.

Next, we will introduce the Pattern object in a moment.

Re.compile (pattern, flags=0) II, Pattern

The Pattern object is a compiled regular expression, and Pattern cannot be instantiated directly, but must be constructed using re.compile ().

The attribute describes the regular expression flags matching pattern used in pattern compilation, the number of groupindex dictionaries for grouping in numeric groups expressions, and the key is the alias for the grouping The value is the grouping number import repattern = re.compile (r'(\ w +) (? P.K.), re.S) # pattern: (\ w +) (? P.K.) print ("pattern:", pattern.pattern) # flags: 48print ("flags:", pattern.flags) # groups: 2print ("groups:", pattern.groups) # groupindex: {'gname': 2} print ("groupindex:", pattern.groupindex) 2.2method

The method described in the previous re module is applicable to Pattern, except that the pattern parameter is missing.

In fact, it is very simple. The method in the re module uses the pattern parameter to construct a Pattern object through the re.compile method.

III. Match

The Match object is the result of a match that contains information about the match, which can be obtained using the properties or methods provided by Match.

Attribute description text used for string matching re gets the expression of Pattern pos text where the regular expression begins the search endpos text where the regular expression ends the search lastindex the index of the last captured group in the text. If no packet is captured, it will be the alias of the last packet captured by Nonelastgroup. If there are no captured packets Will be Noneimport recontent = "123456first123456" reg = r'\ d+.*?'match = re.match (reg, content) # string: 123456first123456print ("string:", match.string) # re: re.compile ('\\ dong.customers') print ("re:", match.re) # pos: 0print ("pos:", match.pos) # endpos: 26print ("endpos:", match.endpos) # lastindex: 1print ("lastindex:", match.lastindex) # lastgroup: numprint ("lastgroup:", match.lastgroup)

I feel that the attribute of Match is a little bit chicken rib.

3.2 method description groups () gets all the matching strings in groups and returns tuple group ([group1, …]) Get the string to which the grouping matches, return the tuple start (group) get the start matching position of the grouping in the original string end (group) get the end matching position of the grouping in the original string span (group) get the start and end matching positions of the grouping in the original string, and the tuple groupdict () gets the matching string of the grouping with aliases and returns the dictionary Aliases are keys expand (template) matching strings can be referenced by aliases and numbers in the template string

Note: parameterless group is equivalent to group (0) and returns the entire matching string

Import rematch = re.match (r'(\ w +) (\ w +) (? P.K.)', 'You love sun') # groups (): (' You', 'love',' sun') print ("groups ():", match.groups ()) # group (2): ('love',' sun') print ("group (2):", match.group (2)) # start (2): 4print ("start (2):" Match.start (2)) # end (2): 8print ("end (2):", match.end (2)) # span (2): (4,8) print ("span (2):", match.span (2)) # groupdict (): {'name':' sun'} print ("groupdict ():", match.groupdict ()) # expand (r love YouTube) (r "expand (ritual I\ 2\ 1mm'):" Match.expand (ringing I\ 2\ 1cm'))

The methods in Match above are important, because we basically end up getting matches through methods in the Match object.

Fourth, regular expression 4.1 commonly used expression description. Match any character, except for newline characters, when the re.S tag is specified, you can match any character including newline characters? Match 0 or 1 fragments defined by the previous regular expression, non-greedy + match 1 or more expressions * match 0 or more expressions [] to represent a set of characters, list separately, [abc] matches characters'a','b'or'c'[^] not in [], [^ abc] matches characters other than afield bdirection c ^ matches the beginning of the string Multiline pattern also matches the end of line\ A matching string beginning $matching string end, multiline pattern also matches line end\ Z matching string end {n} exact n, "o {2}" matches food, mismatches fod and foood {n,} at least n, "o {2,}" matches food,foood, mismatches fod {n, m} matches n to m, "o {2pm 3}", matches food,foood, mismatches fod, fooood | a | b Matching an or bMui-can represent an interval, and [0-9] indicates that any number in 0-9 can be matched.

The most commonly used is. Match any character, A.B can match abb, acb, adb, aqb, a8b, etc.

? Indicates that there is at most one match: abb? Can match ab, abb, but not abbabb, because? It just refers to the previous clip.

+ means to match at least once: abb+ can match abb, abbb, abbbb, etc., but not ab.

* indicates 0 to multiple times: abb* can match ab, abb, abbb, abbbb, etc.

There is a set of characters in [], and the relationship between the characters is or

4.2 Boundary blank expression description\ t tab character\ nline feed\ f page feed\ w matches numbers, letters, underscores, which is equivalent to [a-zA-Z0-9 _]\ W matching non-(numbers, letters, underscores), equivalent to [^ a-zA-Z0-9 _]\ s matching blank characters, equivalent to [\ t\ n\ r\ f]\ S matching non-empty characters. Equivalent to [^\ t\ n\ r\ f]\ d matching numbers, equivalent to [0-9]\ D matching non-numbers, equivalent to [^ 0-9]\ b matching word boundaries,'er\ b 'can match' er','in "over" but not 'er'\ B' in "service" match non-word boundaries,'er\ B 'can match' er','in "service" but cannot match 'er' "in" over "

4.3grouping expression description (re) grouping matching, nested pattern grouping counting is from left to right, from outside to inside\ number reference grouping, using\ 1,\ 2,\ 3... Visit 1, 2, 3. Grouping (? P) specifies the grouping name, refers to the grouping name using name as the alias (? P=name) of the grouping, and applies the grouping through name

One of the most important functions of grouping is to backtrack, that is, to reference a pattern that has already been matched.

Think about it: how do you match all the h tags in html?

Reg =. *?'

Many friends may write expressions similar to the above. Is there a problem?

Look at the following example:

Import recontent = 'first

P tag

H3 illegal tag'rs = re.findall (rhombodylogram, content) print (rs) rs = re.findall (rhombodynia, content) print (rs) rs = re.findall (r' ((). *?)', content) print (rs) rs = re.findall (r' ((). *?)', content) print (rs)

Looking at the output, we know:

Reg =. *?'

Will also match the part of the 'illegal tag'.

We can solve this problem by grouping and then referencing grouping.

Reg1 ='. *? 'reg2 =' ((). *?)'

Because if there is a grouping, the matching grouping is printed out after findall, so we use the regular expression reg2.

Why\ 3?

Because according to the principle from left to right and from outside to inside, we know that ([1-6]) is the third grouping.

If you don't want to count, or if you're afraid of making a mistake, you can use an alias.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.