Tutorial on the use of Python regular expressions 07/01 Update SLTechnology News&Howtos

Tutorial on the use of Python regular expressions

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces the "tutorial on the use of Python regular expressions". In daily operation, I believe that many people have doubts about the use of Python regular expressions. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "the tutorial on the use of Python regular expressions". Next, please follow the editor to study!

Introduction:

Regular expressions are used to identify whether a pattern (pattern) exists in a given sequence of characters (strings). They help with text data, which is usually a prerequisite for data science projects involving text mining. You must have come across some regular expression applications: they are used on the server side to verify the format of e-mail addresses or passwords during registration, to parse text data files to find, replace, or delete certain strings, and so on.

Content:

Regular expressions are very powerful, and in this tutorial, you will learn to use them in Python. You will cover the following topics:

Regular expressions in Python

Basic characters: normal characters

Wildcards: special character

Number of repetition

Grouping using regular expressions

Greedy vs non-greedy matching

Re Python Library-search () and match ()

Regular expressions in Python

Import re Modul

In Python, the re module supports regular expressions. Use the following command to import this module:

Import re

Basic mode: normal characters

You can easily solve many basic patterns in Python with normal characters. Ordinary characters are the simplest regular expressions. They match perfectly and have no special meaning in regular expression syntax.

Examples are "A", "a", "X", "5".

Normal characters can be used to perform simple exact matches:

> Import re > pattern = r "Cookie" > > sequence = "Cookie" > if re.match (pattern, sequence): > print ("Match!") > else: > print ("Not a match!") Match!

Match () if the text matches the pattern, the function returns the matching object. Otherwise, None is returned.

But now let's focus on ordinary characters! Have you noticed the beginning of the r pattern Cookie?

This is called the original string literal. It changes the way string text is interpreted. Such text is stored as it appears.

For example,\ is only a backslash when the current suffix is a, and r is not interpreted as an escape sequence. You will see the meaning with special characters. Sometimes, syntax involves characters that are escaped by backslashes, and to prevent these characters from being interpreted as escape sequences, use the original r prefix. In this example, you don't really need it, but it's a good practice to use it to maintain consistency.

Wildcards: special character

Special characters are characters that do not match a regular expression but actually have a special meaning when used in a regular expression.

The most widely used special characters are:

. -matches any single character except the newline character.

Re.search (ringing Co.k.eforth, 'Cookie'). Group ()' Cookie'

The group () function returns the string re that matches. You will see this feature in more detail later.

\ w-lowercase w. Match any single letter, number or underscore.

Re.search (r'Co\ wk\ we', 'Cookie'). Group ()' Cookie'

\ W-uppercase w. Matches any characters (lowercase w) that do not belong to\ w.

Re.search (ringing C\ Wke', 'Crunke'). Group () 'Crunke'

\ s-lowercase letter s. Matches a single space character, such as: space, newline character, tab, return value.

Re.search (r'Eat\ scake', 'Eat cake'). Group ()' Eat cake'

\ s-uppercase s. Matches any characters that do not belong to\ s (lowercase s).

Re.search (r'Cook\ Se', 'Cookie'). Group ()' Cookie'

\ t-lowercase t. Match the label.

Re.search (r'Eat\ tcake', 'Eat cake'). Group ()' Eat\ tcake'

\ n-lowercase letter n. Matches the newline character.

\ r-lowercase letter r. Back from the game.

\ d-lowercase letter d. Matches the decimal number 0-9.

Re.search (rascc\ d\ dkie', 'c00kie'). Group ()' c00kie'

^-the caret matches a pattern at the beginning of the string.

Re.search (r'^ Eat', 'Eat cake'). Group ()' Eat'

$- matches the pattern at the end of the string.

Re.search (ringing cakeboxes, 'Eat cake'). Group ()' cake'

[abc]-matches an or b or c.

[a-zA-Z0-9]-matches any letter in (a to z) or (A to Z) or (0 to 9). You can match characters that are not in range by supplementing the collection. If the first character of the collection is ^, all characters that are not in the collection will be matched.

Re.search (r'Number: [0-6]', 'Number: 5'). Group () 'Number: 5' # Matches any character except 5 re.search (r'Number: [^ 5]', 'Number: 0'). Group () 'Number: 0'

\ A-capital a. Matches only at the beginning of the string. You can also work across multiple lines.

Re.search (r'\ A [A-E] ookie', 'Cookie'). Group ()' Cookie'

\ b-lowercase b. Matches only the beginning or end of the word.

Re.search (r'\ b [A-E] ookie', 'Cookie'). Group ()' Cookie'

\-backslash. If the character after the backslash is a recognized escape character, the special meaning of the term is adopted. For example,\ nis considered a newline character. However, if the following character\ is not a recognizable escape character,\ will be treated and passed like any other character.

Let's look at a few examples:

# This checks for'\'in the string instead of'\ t 'due to the'\ 'used re.search (r'Back\\ stail',' Back\ stail'). Group () 'Back\\ stail' # This treats'\ s'as an escape character because it lacks'\'at the start of'\ s' re.search (r'Back\ stail', 'Back tail'). Group ()' Back lash'

Number of repetition

If you look for long patterns in a sequence, it becomes very tedious. Fortunately, the re module uses the following special characters to handle repetition:

+-check one or more characters to the left of it.

Re.search (ringing Cookie Kiething, 'Cooookie'). Group ()' Cooookie'

*-check if there are zero or more characters on the left.

# Checks for any occurrence of an or o or both in the given sequence re.search (ringing Cajun Onekiewings, 'Caokie'). Group ()' Caokie'

?-check to see if the left side is zero or one character.

# Checks for exactly zero or one occurrence of an or o or both in the given sequence re.search (ringing Colousel ritual, 'Color'). Group ()' Color'

But what if you want to check the exact number of sequence repeats?

For example, check the validity of the phone number in the application. The re module also handles this problem well with the following regular expressions:

{x}-repeat x times.

{x,}-repeat at least x or more times.

{x, y}-repeat at least x times but not more than y times.

Re.search (r'\ d {9pm 10}', '0987654321') .group ()' 0987654321'

Will + and * qualify are considered greedy.

Grouping and grouping using regular expressions

Suppose that when you verify the e-mail address and want to check the user name and host respectively.

This is when the group regular expression feature comes in handy. It allows you to pick a portion of the matching text.

The part of the regular expression pattern defined by parentheses () is called groups. Parentheses do not change what the expression matches, but form a group within the matching sequence. Group () you have been using this feature throughout the examples in this tutorial. Match.group () as usual, plain text with no parameters is still the entire matching text.

Email_address = 'Please contact us at: support@datacamp.com' match = re.search (r' ([\ w\. -] +) @ ([\ w\. -] +)', _) if _: print (match.group ()) # The whole matched text print (match.group (1)) # The username (group 1) print (match.group (2)) # The host (group 2)

Greedy vs non-greedy matching

When special characters match the search sequence (string) as much as possible, it is called "greedy matching". This is the normal behavior of regular expressions, but sometimes you don't want it to happen:

Pattern = "cookie" sequence = "Cake and cookie" heading = ringing Thiele 're.match (ruminant, heading). Group ()' TITLE'

The pattern matches the entire string until the second occurrence >.

However, if you only want to match the first one

Tag, you can use the greedy qualifier *?, which matches as little text as possible.

? Add after the qualifier to perform the match in a non-greedy or minimal manner; that is, it will match as few characters as possible. When you run, you will only race.

Heading = ringing Tilly 're.match (ringing cats, heading). Group ()'

Re Python library

The library in Re Python provides several functions that make it worth mastering. You have seen some of them, such as re.search (), re.match (). Let's examine some useful features in detail:

Search (pattern, string, flags=0)

Using this feature, you can scan a given string / sequence to find the first location where the regular expression produces a match. If found, the corresponding matching object is returned; otherwise, None returns if no position in the string matches the pattern. Note that this None is different from finding a zero-length match at some point in the string.

Pattern = "cookie" sequence = "Cake and cookie" re.search (pattern, sequence). Group () 'cookie'

Match (pattern, string, flags=0)

If the zero or more characters at the beginning of the string match the pattern, the corresponding matching object is returned. Otherwise, None returns if the string does not match the given pattern.

Pattern = "C" sequence1 = "IceCream" # No match since "C" is not at the start of "IceCream" re.match (pattern, sequence1) sequence2 = "Cake" re.match (pattern,sequence2). Group ()'C'

Search () and match ()

The match () function checks for matches only at the beginning of the string (by default), while the search () function checks for matches anywhere in the string.

Findall (pattern, string, flags=0)

Finds all possible matches in the entire sequence and returns them as a list of strings. Each returned string represents a match.

Email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com" # 'addresses' is a list that stores all the possible match addresses = re.findall (r' [\ w\. -] + @ [\ w\. -] +', email_address) for address in addresses: print (address) support@datacamp.com xyz@datacamp.com

Sub (pattern, repl, string, count=0, flags=0)

This is the substitute function. It returns the string repl obtained by replacing or replacing the leftmost non-overlapping pattern in the string with substitution. If the pattern is not found, the string is returned as is.

Email_address = "Please contact us at: xyz@datacamp.com" new_email_address = re.sub (r'([\ w\. -] +) @ ([\ w\. -] +)', email_address) print (new_email_address) Please contact us at: support@datacamp.com

Compile (pattern, flags=0)

Compiles a regular expression pattern into a regular expression object. When you need to use an expression multiple times in a single program, it is more efficient to use this compile () function to save the resulting regular expression object for reuse. This is because compile () caches the compiled version of the latest schema passed to, as well as module-level matching capabilities.

Pattern = re.compile (r "cookie") sequence = "Cake and cookie" pattern.search (sequence). Group () 'cookie' # This is equivalent to: re.search (pattern, sequence). Group ()' cookie'

Tip: you can modify the behavior of an expression by specifying its values. You can add an additional parameter to the various features seen in this tutorial by flag. Some of the logos used are: IGNORECASE,DOTALL,MULTILINE,VERBOSE, etc.

Case study: using regular expressions

By learning some examples, you've seen how regular expressions work in Python, and it's time to do it! In this case study, you will apply your knowledge.

Import reimport requests the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt' def get_book (url): # Sends a http request to get the text from project Gutenberg raw = requests.get (url). Text # Discards the metadata from the beginning of the book start = re.search (r "\ * START OF THIS PROJECT GUTENBERG EBOOK. *\ *", raw). End () # Discards the metadata from the end of the book stop = re.search (r "II" Raw) .start () # Keeps the relevant text text = raw [start:stop] return text def preprocess (sentence): return re.sub ('[^ A-Za-z0-9.] +',', sentence). Lower () book = get_book (the_idiot_url) processed_book = preprocess (book) print (processed_book)

Find the number of the pronoun "the" in the corpus. Tip: use the len () function.

Len (re.findall (rushing theirs, processed_book)) 302

Try to convert an independent instance of each "I" in the corpus to "I". Make sure you don't change the "I" that appears in a word:

Processed_book = re.sub (r'\ si\ slots, "I", processed_book) print (processed_book)

Find the number of times someone in the corpus has been quoted ().

Len (re.findall (r'\ "', book)) 96

What is connected by the words'-'in the corpus?

Re.findall (r'[a-zA-Z0-9] *-- [a-zA-Z0-9] *', book) ['ironical--it',' malicious--smile', 'fur--or',-omitted] this ends the study of "tutorial on the use of Python regular expressions", hoping to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.