Python crawl prepares two regular expressions 07/15 Update SLTechnology News&Howtos

Python crawl prepares two regular expressions

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Re module

The re module enables the Python language to have all the regular expression functions.

The compile function generates a regular expression object based on a pattern string and optional flag parameters. The object has a series of methods for regular expression matching and replacement.

Escape character

Regular expressions use "\" for escape, while python also uses "\" for escape. When you encounter special characters that need to be escaped, you have to think about how many "\" you need. So to avoid this, it is recommended to use the native string type (raw string) to write regular expressions.

The method is simple, just precede the expression with an "r", as follows:

R'\ d {2} -\ d {8}'r'\ bt\ w*\ b'

Common function

Re.match ()

Match from the starting position of the string. If the match is successful, a matching object is returned. Otherwise, None is returned.

Syntax: re.match (pattern, string, flags=0) pattern: matching regular expression string: the string to match flags: flag bit, which is used to control how regular expressions are matched, such as case sensitivity, multi-line matching, etc.; flags=0 indicates no special assignment

The optional flags are as follows:

The modifier is specified as an optional flag. Multiple flags can be specified by bitwise OR (|) them. For example, re.I | re.M is set to I and M flags

Re.search ()

Scan the entire string and return the first successful match, otherwise return None

Syntax: re.search (pattern, string, flags=0)

The difference between re.match and re.search

Re.match only matches the beginning of the string. If the string does not match the regular expression at the beginning, the match fails, and the function returns None; and re.search matches the entire string until a match is found (note: only the first one)

Re.findall ()

Find all the substrings matched by the regular expression in the string and return a list, or an empty list if no matches are found

Note: match and search match once, while findall matches all

Re.split ()

Divides the characters into a list based on the delimiters in the regular expression and returns a list of successful matches.

Re.sub ()

Used to replace matches in a string

Syntax: re.sub (pattern, repl, string, count=0)

Pattern: the pattern string in the regular. Repl: the replacement string, which can also be a function. String: the original string to be found and replaced. Count: the maximum number of substitutions after a pattern match. The default of 0 means replacing all matches.

Re.compile ()

The compile function is used to compile regular expressions, generate a regular expression (Pattern) object, and then match strings with the compiled regular expressions.

Pattern: a regular expression flags in the form of a string: optional, indicating matching patterns, such as ignoring case, multiline patterns, etc.

Greedy matching and non-greedy matching

Greedy matching: matching as many characters as possible; non-greedy matching: matching as few characters as possible

The default for regular matching of python is greedy matching.

> re.match (r'^ (\ w +) (\ d *) $', 'abc123'). Groups () (' abc123',') > re.match (r'^ (\ w +?) (\ d *) $', 'abc123'). Groups () (' abc', '123') expression 1:\ W + means matching letters or numbers or underscores or Chinese characters and repeating one or more times;\ d * means matching numbers and repeating 0 or more times. Grouping 1 (\ w) is greedy matching, it matches as many characters as possible if grouping 2 (\ d*) is satisfied, because grouping 2 (\ d*) matches 0 digits, so grouping 1 matches all characters, and grouping 2 can only match empty. Expression 2: add a? Non-greedy matching can be performed, such as (\ greedy?) above, because grouping 1 does non-greedy matching, that is, if grouping 2 matches, grouping 1 matches as few as possible, so that grouping 2 (\ d *) above matches all numbers (123), so grouping 1 matches (abc).

Common matching patterns

Regular expressions need to match strings of variable length, so you must need an indicator to indicate repetition. The regular formula of Python indicates that the function of repetition is rich and flexible. The general form of a repetition rule is to follow a character rule with a rule that represents the number of repeats, indicating that the previous rule needs to be repeated a certain number of times.

Examples of matching rules

1. Ordinary characters: most characters and letters match themselves

Re.findall ("alexsel", "gtuanalesxalexselericapp") ['alexsel'] re.findall ("alexsel", "gtuanalesxalexswxericapp") [] re.findall ("alexsel", "gtuanalesxalexselwupeiqialexsel") [' alexsel', 'alexsel']

two。 Metacharacter:. ^ $* +? {} [] | ()\

. Match any character except the newline character

Re.findall ("alexsel.w", "aaaalexselaw") ['alexselaw'] # A dot can only match one character

^: it can be matched only if the following string is at the beginning.

Re.findall ("^ alexsel", "gtuanalesxalexselgeappalexsel") [] re.findall ("^ alexsel", "alexselgtuanalesxalexselwgtappqialexsel") ['alexsel'] # "^" the symbol controls the beginning, so write at the beginning

$: can match only if the string before it is at the end of the detected string

Re.findall ("alexsel$", "alexselseguanalesxalexselganapp") [] re.findall ("alexsel$", "alexselgtaanalesxalexsssiqialexsel") ['alexsel']

*: it controls the character in front of it, and 0 or more characters in front of it can be matched.

Re.findall ("alexsel*", "aaaalexse") ['alexse'] re.findall ("alexsel*", "aaaalexsel") [' alexsel'] re.findall ("alexsel*", "aaaalexsellllll") ['alexsellllll']

+: matches the previous character 1 to multiple times

Re.findall ("alexsel+", "aaaalexselll") ['aleselll'] re.findall ("alexsel+", "aaaalexsel") [' alexsel'] re.findall ("alexsel+", "aaaalexse") []

? Match 0 to 1 of the previous characters, and only one of the extra characters

Re.findall ("alexsel?", "aaaalexse") ['ale'] re.findall ("alexsel?", "aaaalexsel") [' alexsel'] re.findall ("alexsel?", "aaaalexsellll") ['alexsel']

{}: controls the number of matches in front of it. There can be an interval (closed interval), and if there is an interval, there are many matches.

Re.findall ("alexsel {3}", "aaaalexselllll") ['alexselll'] re.findall ("alexsel {3}", "aaaalexsell") [] re.findall ("alexsel {3}", "aaaalexse") [] re.findall ("alexsel {3}", "aaaalexselll") [' alexselll'] re.findall ("alexsel {3c5}", "aaaalexsellllllll") ['alexselllll'] re.findall ("alexsel {3c5}", "aaaalexselll") [' alexselll'] re.findall ("alexsel {3c5}", "aaaalexsell") []

Remove special functions followed by metacharacters

It is followed by ordinary characters to achieve special functions.

Refers to the string matched by the word group corresponding to the sequence number (a parenthesis is a group).

Add r at the beginning to indicate no escape.

#\ 2 is equivalent to the second group (eric) re.search (r "(alexsel) (eric) com\ 2", "alexselericcomeric"). Group () 'alexselericcomeric' re.search (r "(alexsel) (eric) com\ 1", "alexselericcomalex"). Group ()' alexselericcomalex're.search (r "(alexsel) (eric) com\ 1\ 2", "alexselericcomalexseleric"). Group () 'alexselericcomalexeric'

\ d: matches any decimal number; it is equivalent to class [0-9]

Re.findall ("\ d", "aaazz1111344444c") ['1x,'1x,'1x,'1x,'3x,'4A,'4A,'4C,'4A,'4'] re.findall ("\ d\ d", "aaazz1111344444c") ['11','11','34','44','44'] re.findall ("\ d0", "aaazz1111344444c") [] re.findall ("\ d3") "aaazz1111344444c") ['13'] re.findall ("\ d4", "aaazz1111344444c") [' 344th, '444th,' 44']

\ D: matches any non-numeric character; it is equivalent to class [^ 0-9]

Re.findall ("\ D", "aaazz1111344444c") ['A','z','c'] re.findall ("\ D\ D", "aaazz1111344444c") ['aa',' az'] re.findall ("\ D\ d\ D", "aaazz1111344444c") [] re.findall ("\ D\ d\ D", "aaazz1z111344444c") ['z1z']

\ s: matches any white space character; it is equivalent to the class [\ t\ n\ r\ f\ v]

Re.findall ("\ s", "aazz1 z11.34c") [']

\ s: matches any non-white space character; it is equivalent to the class [^\ t\ n\ r\ f\ v]

\ W: matches any alphanumeric character; it is equivalent to class [a-zA-Z0-9 _]

Re.findall ("\ w", "aazz1z11..34c") ['asides,' zails, 'zails,' 1','1','1','3','4','c']

\ W: matches any non-alphanumeric character; it is equivalent to the class [^ a-zA-Z0-9 _]

\ b: match a word boundary, that is, the position between the word and the space

Re.findall (r "\ babc\ b", "abc sdsadasabcasdsadasdabcasdsa") ['abc'] re.findall (r "\ balexsel\ b", "abc alexsel abcasdsadasdabcasdsa") [' alexsel'] re.findall ("\\ balexsel\\ b", "abc alexsel abcasdsadasdabcasdsa") ['alexsel'] re.findall ("\ balexsel\ b", "abc alexsel abcasdsadasdabcasdsa") []

(): treat the characters in parentheses as a whole

Re.search (r "a (\ d +)", "a222bz1144c") .group () 'a222're.findall ("(ab) *", "aabz1144c") [', 'ab',''] # match the string in parentheses as the whole and the following characters one by one. Here, first match the an and ab in the following string into # lines, and the first match is successful. After looking at a, there is a mismatch with the second one in ab, and then look at the second an in the following string, which matches ab. First, a matches successfully, b matches successfully, then the third one in the following string is b, the first match fails, and the fourth matches. Then re.search (r "a (\ d +)", "a222bz1144c"). Group () 'a222're.search (r "a (\ d +?)", "a222bz1144c"). The minimum degree of group () + is 1'a2're.search (r "a (\ d +?)", "a222bz1144c"). The minimum degree of group () * is the addition of 0'a'# non-greedy matching pattern? However, if it is followed by a matching character, it is impossible to achieve a non-greedy match # (if there are matching conditions before and after, the non-greedy mode cannot be achieved) re.findall (r "a (\ dink?) b", "aa2017666bz1144c") ['2017666'] re.search (r "a (\ dbath?) b", "a222bz1144c"). Group ()' a222b' re.search (r "a (\ dbath?) b", "a277722bz1144c"). Group () 'a277722b'

Metacharacters represent characters in the character set and have no special meaning (with several exceptions)

Re.findall ("a [.] d", "aaaacd") [] re.findall ("a [.] d", "aaaa.d") ['a.d']

Exception

[-] [^] []

[-] matches a single character, all characters a to z

Re.findall ("[a murz]", "aaaa.d") ['asides,' d'] re.findall ("[amurz]", "aaazzzzzaaccc") ['asides,' zones, 'zines,' zines, 'zales,' asides, 'canals,' canals 'c'] re.findall ("[1-3]", "aaazz1111344444c") ['1x,'1x,'1x,'1x,'3']

[^] matches in addition to the characters in this range, (^ has a non-meaning here)

Re.findall ("[^ 1-3]", "aaazz1111344444c") ['asides,' zails, 'zones,' 4','4','4','4','c'] re.findall ("[^ 1-4]", "aaazz1111344444c") ['asides,' asides, 'zealots,' zines,'c']

[]

Re.findall ("[\ d]", "aazz1144c") ['1th,' 1th, '44th,' 4']

The first metacharacters we examined are "[" and "]". They are often used to specify a character category, which is the character set you want to match. Characters can be listed individually, or a character interval can be represented by two given characters separated by the "-" sign. For example, [abc] will match any character in "a", "b", or "c"; you can also use the interval [Amurc] to represent the same character set, which is the same as the former. If you only want to match lowercase letters, then RE should be written as [a Murz], and metacharacters do not work in categories. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; "$" is usually used as a metacharacter, but in the character category, its properties are removed and restored to normal characters.

Word boundary

Re.findall (r "\ babc", "abcsd abc") ['abc',' abc'] re.findall (r "abc\ b", "abcsd abc") ['abc'] re.findall (r "abc\ b", "abcsd abc*") [' abc'] re.findall (r "\ babc", "* abcsd*abc") ['abc',' abc']

# Detection word boundaries are not necessarily spaces, but can also be special characters other than letters

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.