Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use regular expressions in Python3 projects

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about how to use regular expressions in the Python3 project. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

For example, we enter the text to be matched here as follows:

Hello, my phone number is 010-86432100 and email is cqc@cuiqingcai.com, and my website is http://cuiqingcai.com.

This string contains a phone number and an email, and then we try to extract it with a regular expression.

If we choose to match the Email address in the web page, we can see that the Email in the text appears below. If we select the matching URL URL, we can see that the URL appears in the text below. Isn't it amazing?

In fact, regular expression matching is used here, that is, certain rules are used to extract specific text. For example, an email begins with a string, followed by an @ symbol, and then a domain name, which has a specific composition format. In addition, for URL, it starts with the protocol type, followed by a colon plus a double slash, and then a domain name plus a path.

For URL, we can match with the following regular expression:

[a-zA-z] +: / / 1*

If we use this regular expression to match a string, if the string contains URL-like text, it will be extracted.

This regular expression looks like a mess, but it's not. There are specific grammatical rules in it. For example, a murz means to match any lowercase letter, s means to match any white space character, and * means to match any number of previous characters. This long string of regular expressions is a combination of so many matching rules, and finally implements a specific matching function.

After we have written the regular expression, we can match it in a long string. No matter what is in the string, we can find it as long as it meets the rules we wrote. So for a web page, if we want to find out how much URL is in the source code of the web page, we can match it with a regular expression that matches URL, and we can get the URL in the source code.

We talked about several matching rules above, so how many rules are there for regular expressions? So here's a summary of the common matching rules:

Pattern description

W match alphanumeric and underscore

W matches non-alphanumeric and underscore

S matches any white space character, which is equivalent to [tnrf].

S matches any non-empty character

D matches any number, which is equivalent to [0-9]

D matches any non-number

A match string start

Z matches the end of the string. If there is a newline, it only matches to the ending string before the newline.

Z match string end

G match the position where the last match was completed.

N matches a newline character

T matches a tab

^ matches the beginning of the string

$matches the end of the string.

. Matches any character, except for newline characters, and when the re.DOTALL tag is specified, it can match any character including newline characters.

[...] Used to represent a set of characters, listed separately: [amk] matches' axiomagem'or'k'

2 characters that are not in []: 3 matches characters other than a _ r _ b _ r c.

* match 0 or more expressions.

+ matches one or more expressions.

? Match 0 or 1 fragments defined by the previous regular expression, in a non-greedy way

{n} exactly matches n previous expressions.

{n, m} match n to m fragments defined by the previous regular expression, greedily

A | b matches an or b

() matches the expression in parentheses and also represents a group

Maybe you will get a little dizzy after that. Don't worry, we will explain the usage of some common rules in detail below. How to use it to extract the information we want from the web page.

Used in Python

In fact, regular expressions are not unique to Python, it can also be used in other programming languages, but Python's re library provides the entire implementation of regular expressions, using the re library we can use regular expressions in Python, which is almost always used to write regular expressions in Python.

Let's take a look at its usage.

Match ()

Here we first introduce the first commonly used matching method, the match () method. We pass the string to match and the regular expression to this method, and we can detect whether the regular expression matches the string.

The match () method tries to match the regular expression from the beginning of the string, and if it does, it returns a successful match, and if it doesn't, it returns None.

Let's use an example to feel it:

Import re content = 'Hello 4567 World_This is a Regex Demo' print (len (content)) reresult = re.match (' ^ Hello\ s\ d\ s\ d {4}\ s\ w {10}', content) print (result) print (result.group ()) print (result.span ())

Running result:

forty-one

Hello 123 4567 World_This

(0,25)

Here we first declare a string that contains English letters, white space characters, numbers, and so on, and then we write a regular expression ^ Hellosdddsd {4} sw {10} to match this long string.

The beginning ^ is the beginning of the matching string, that is, it starts with Hello, and then s matches the blank character, which is used to match the space of the target string, d matches the number, three d matches 123, and then writes an s matching space, followed by 4567. We can actually still match with four d, but it's more complicated to write this, so we can match the first character four times with {4}. That is, you can match four numbers, so you can also complete the match, followed by a blank character, and then w {10} matches 10 letters and underscores, and the regular expression ends there. we noticed that we didn't match the target string, but it was still possible to match, but the result was shorter.

We call the match () method, with the first argument passed in the regular expression and the second argument passing in the string to match.

Print out the result, you can see that the result is a SRE_Match object, prove a successful match, it has two methods, the group () method can output the matching content, the result is Hello 4567 World_This, which is exactly what our regular expression rules match, the span () method can output the matching range, the result is (0,25), this is the position range of the matching result string in the original string.

From the above example, we can basically understand how to use regular expressions to match a paragraph of text in Python.

Matching target

We just used the match () method to get the matching string content, but what if we want to extract part of the string? Like the previous example, something such as an email or phone number is extracted from a piece of text.

Here, we can use () parentheses to enclose the substrings we want to extract. () actually marks the beginning and end of a subexpression, and each marked subexpression corresponds to each grouping in turn. We can call the group () method to pass in the index of the grouping to get the extracted results.

Let's use an example to feel it:

Import re content = 'Hello 1234567 World_This is a Regex Demo' reresult = re.match (' ^ Hello\ s (\ d +)\ sWorld', content) print (result) print (result.group ()) print (result.group (1)) print (result.span ())

Still the previous string, here we want to match this string and extract 1234567 of it, here we enclose the regular expression of the numeric part in (), and then call group (1) to get the matching result.

The running results are as follows:

Hello 1234567 World

1234567

(0,19)

You can see that we successfully got 1234567 in the result. Unlike group (1), group (1) will output the complete match result, while group (1) will output the first match result surrounded by (). If there is something included in () after the regular expression, then we can get it in turn with group (2), group (3), and so on.

Universal matching

The regular expression we wrote just now is actually quite complicated. when there are blank characters, we write s to match blank characters, and when there are numbers, we write d to match digits. The workload is very heavy, in fact, there is no need to do this at all. There is also a universal matching can be used, that is. It can match any character (except the newline character), and it also means that the previous characters can be matched indefinitely, so they can be combined to match any character, with which we do not have to match character by character.

So following the above example, we can rewrite the regular expression.

Import re content = 'Hello 123 4567 World_This is a Regex Demo' reresult = re.match (' ^ Hello.*Demo$', content) print (result) print (result.group ()) print (result.span ())

Here, we will directly omit the middle part and replace it all with. *, and finally add an ending string. The running result is as follows:

Hello 123 4567 World_This is a Regex Demo

(0,41)

You can see that the group () method outputs all the matching strings, that is, the regular expression we write matches the entire contents of the target string, and the output of the span () method (0,41) is the length of the entire string.

Therefore, we can use. * to simplify the writing of regular expressions.

Greedy matching and non-greedy matching

When using the above universal match. * it is possible that sometimes what we match is not the desired result, let's take a look at the following example:

Import re content = 'Hello 1234567 World_This is a Regex Demo' reresult = re.match (' ^ He.* (\ d +). * Demo$', content) print (result) print (result.group (1))

Here we still want to get the numbers in the middle, so we still write (d +) in the middle. Because the contents on both sides of the numbers are messy, we want to omit them and write them both. Finally, we form ^ He. (d +). * there seems to be no problem with Demo$,. Let's take a look at the results:

seven

Something strange happened. We only got the number 7. What's going on?

Here is a reason for greedy matching and non-greedy matching, greedy matching. Will match as many characters as possible in our regular expression. This is followed by dnumbers, that is, at least one number, which does not specify a specific number of digits, so. * matches as many characters as possible, so it matches 123456, leaving a qualified number 7 for d +, so all d + gets is the number 7.

But this will obviously bring a lot of inconvenience to our matching, and sometimes the matching result will be inexplicably missing. In fact, here we just need to use non-greedy matching, non-greedy matching is written as. *?, one more?, so what effect can it achieve? Let's use another example to feel it:

Import re content = 'Hello 1234567 World_This is a Regex Demo' reresult = re.match (' ^ He.*? (\ d +). * Demo$', content) print (result) print (result.group (1))

Here we will only be the first. Changed to.?, changed to non-greedy matching. The results are as follows:

Good, now we can successfully get 1234567. It is conceivable that greedy matching is to match as many characters as possible, and non-greedy matching is to match as few characters as possible. After that, d + is used to match numbers, when. When matching the white space character after Hello, the next character is a number, and d + happens to match, so here.? Instead of matching, give it to d + to match the following numbers. So, like this,...? If you match as few characters as possible, the result of d + is 1234567.

So, when doing matching, we can try to use non-greedy matching in the middle of the string, that is, using.? Instead of., in order to avoid missing matching results.

But note here that if the result of the match is at the end of the string,. *? It is possible that nothing will be matched because it matches as few characters as possible, such as:

Import re content = 'http://weibo.com/comment/kEraCN' reresult1 = re.match (' http.*?comment/ (. *)', content) reresult2 = re.match ('http.*?comment/ (. *)', content) print ('result1', result1.group (1)) print (' result2', result2.group (1))

Running result:

Result1

Result2 kEraCN

Observe.? Did not match any results, and. Try to match as many contents as possible, and successfully get the matching result.

So it is very helpful to write regular expressions later to experience the principles of greedy matching and non-greedy matching.

Modifier

Regular expressions can contain optional flag modifiers to control matching patterns. The modifier is specified as an optional flag.

Let's use an example to feel it first:

Import re content =''Hello 1234567 World_This is a Regex Demo' 'reresult = re.match (' ^ He.*? (\ d +). *? Demo$', content) print (result.group (1))

Similar to the example above, we add a newline character to the string, and the regular expression is the same to match the numbers. Take a look at the result of the run:

AttributeError Traceback (most recent call last) in () 5''6 reresult = re.match ('^ He.*? (\ d +). *? Demo$', content)-> 7 print (result.group (1)) AttributeError: 'NoneType' object has no attribute' group'

The run reports an error directly, that is, the regular expression does not match the string, the result is None, and we call the group () method, which results in AttributeError.

So why can't we match when we add a newline character? Because. Matches any character except the newline, and when a newline is encountered,. *? There is no match, so the match fails.

So here we only need to add a modifier re.S to correct this error.

Reresult = re.match ('^ He.*? (\ d +). *? Demo$', content, re.S)

The third parameter of the match () method is passed in re.S, which serves to make. Matches all characters, including newline characters.

Running result:

1234567

This re.S is often used in web page matching, because HTML nodes often have line breaks, and with it we can match line breaks between nodes.

There are also some modifiers that can be used if necessary:

Modifier description

Re.I makes matches case insensitive

Re.L does localization identification (locale-aware) matching

Re.M multiline matching, affecting ^ and $

Re.S makes. Match all characters, including line breaks

Re.U parses characters based on the Unicode character set. This sign affects w, W, b, B.

Re.X this flag allows you to write regular expressions more easily by giving you a more flexible format.

Re.S and re.I are commonly used in web page matching.

Escape matching

We know that regular expressions define many matching patterns, such as. Matches any character except the newline character, but contains it if it is in the target string. How do we match?

Then we need to use escape matching here. Let's use an example to feel it:

Import re content ='(Baidu) www.baidu.com' reresult = re.match ('\ (Baidu\) www\ .baidu\ .com', content) print (result)

When we encounter special characters for regular matching patterns, we can match them by escaping them with a backslash. For example. We can use it. To match, run the result:

You can see that the original string is successfully matched.

The above are several knowledge points commonly used in writing regular expressions, and mastering the above knowledge points is very helpful for us to write regular expression matching later.

Search ()

We mentioned earlier that the match () method starts the match at the beginning of the string, and once the beginning does not match, the whole match fails.

Let's look at the following example:

Import re content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings' reresult = re.match (' Hello.*? (\ d +). *? Demo', content) print (result)

Here we have a string that starts with Extra, but for the regular expression we start with Hello, and the whole regular expression is part of the string, but this matching fails, that is, as long as the first character does not match the entire match, the result is as follows:

None

So the match () method needs to take into account the beginning of the content when we use it, so it is not so convenient to do matching, it is suitable to detect whether a string conforms to the rules of a regular expression.

So here is another method, search (), which scans the entire string during the match and returns the first successful match result, that is, the regular expression can be part of the string, and during the match, the search () method scans the string in turn until the first string that meets the rules is found, and then returns the match, and if it is not found after the search, it returns None.

Let's change the match () method in the above code to search (), and then take a look at the running result:

In this way, we get the matching result.

So, for the convenience of matching, we can try to use the search () method.

Let's use a few more examples to feel the use of the search () method.

First of all, there is a piece of HTML text to be matched, and then we write a few regular expression examples to extract the corresponding information.

Html = 'Classic Old songs list of Classic Old songs

There is a smile from you along the way, the past is glorious with the wind, the past is glorious, the notepad is willing to last for a long time.

It is observed that there are many nodes in the node, some of which contain nodes, some do not contain nodes, and the nodes also have some corresponding attributes, hyperlinks and singer names.

First we try to extract the names of singers and songs contained in the hyperlinks within the nodes where class is active.

So we need to extract the singer attribute and text of the node under the third node.

So the regular expression can start with, then look for a marker active, and the middle part can be used with.? To match, and then we need to extract the value of the singer attribute, so we also need to write singer= "(.?)". The part we need to extract is enclosed in parentheses so that it can be extracted by the group () method. It has double quotation marks on both sides, and then we need to match the text of the node, so its left boundary is >, and the right boundary is, so let's specify the left and right boundaries. Then the target content still uses (.?) To match, so the final regular expression becomes (. *?)', and then we call the search () method, which searches the entire HTML text to find the first content that matches the regular expression to return.

In addition, because the code has line breaks, the third parameter needs to be passed in re.S

So the whole matching code is as follows:

Reresult = re.search ('(. *?)', html, re.S) if result: print (result.group (1), result.group (2))

Since the singer and song name we need to get are surrounded by parentheses, we can use the group () method to obtain the sequence number corresponding to the parameter of group () in turn.

Running result:

The past of Qi and Qin goes with the wind

You can see that this is exactly the name of the singer and song contained in the hyperlink inside the node where we want to extract class for active.

So what happens to regular expressions without active? That is, to match the node content without active with class, we remove the active from the regular expression and rewrite the code as follows:

Reresult = re.search ('(. *?)', html, re.S) if result: print (result.group (1), result.group (2))

Since the search () method returns the first matching target that meets the criteria, the result changes here.

The running results are as follows:

Ren Xianqi laughed in Canghai.

Because after we remove the active tag, the search starts at the beginning of the string, and the node that meets the criteria becomes the second node, and the rest is no longer matched, so the run result naturally becomes the content of the second node.

Notice that in the above two matches, we added re.S to the third parameter of the search () method, so that. *? The newline can be matched, so the node with the newline is matched, what will be the result if we remove it?

Reresult = re.search ('(. *?)', html) if result: print (result.group (1), result.group (2))

Running result:

The glory days of beyond

You can see that the result becomes the content of the fourth node because both the second and third tags contain newline characters. * after removing re.S? Newline characters can no longer be matched, so the regular expression does not match to the second and third nodes, and the fourth node does not contain newline characters, so the match is successful.

Since the vast majority of HTML text contains newline characters, through the above example, we try to add re.S modifiers to avoid mismatches.

Findall ()

Earlier we talked about the use of the search () method, which returns the first content that matches the regular expression, but what if we want to get everything that matches the regular expression? At this point, you need to use the findall () method.

The findall () method searches the entire string and returns everything that matches the regular expression.

Again, the HTML text above, if we want to get hyperlinks, singers, and song names for all nodes, we can replace the search () method with the findall () method. It's the list type if any, so we need to traverse the list to get each set of content in turn.

Reresults = re.findall ('(. *?)', html, re.S) print (results) print (type (results)) for result in results: print (result) print (result [0], result [1], result [2])

Running result:

[('/ 2.mp3Qing, 'Ren Xianqi', 'laugh from the sea'), ('/ 3.mp3Qing,'Qi Qin', 'past events with the wind'), ('/ 4.mp3Qing, 'beyond',' glorious years'), ('/ 5.mp3Qing, 'Chen Huilin', 'notepad'), ('/ 6.mp3', 'Teresa Teng', 'wish people will be long')]

('/ 2.mp3mm, 'Ren Xianqi', 'laughter from the sea')

/ 2.mp3 Ren Xianqi Canghai laughs

('/ 3.mp3fu,'Qi Qin', 'the past follows the wind')

/ 3.mp3 the past of Qi and Qin goes with the wind

('/ 4.mp3years, 'beyond',' glory days')

/ 4.mp3 beyond glory days

('/ 5.mp3percent, 'Huilin Chen', 'notepad')

/ 5.mp3 Huilin Chen notepad

('/ 6.mp3', 'Teresa Teng', 'May people last forever')

/ 6.mp3 Teresa Teng wishes for a long time

As you can see, each element of the returned list is of type tuple, which can be fetched out in turn with the corresponding index.

So, if you just get the first content, you can use the search () method, and when you need to extract multiple content, you can use the findall () method.

Sub ()

In addition to extracting information, regular expressions sometimes need it to modify the text. For example, if we want to remove all the numbers in a string of text, it would be too tedious if we only use the string replace () method. Here we can use the sub () method.

Let's use an example to feel it:

Import re content = '54aK54yr5oiR54iR54ix5L2g' content = re.sub ('\ dudes,', content) print (content)

Running result:

AKyroiRixLg

Here we just need to pass d + in the first parameter to match all the numbers, then the second parameter is the replaced string, if you want to remove it, it can be assigned to empty, and the third parameter is the original string.

The result is to replace the modified content.

So in the above HTML text, if we want to get the song names of all nodes regularly, it may be cumbersome to extract them directly with regular expressions, for example, we can write them like this:

Reresults = re.findall ('\ s stuff? ()? (\ w +) ()?\ s examples, html, re.S) for result in results: print (result [1])

Running result:

You were there all the way.

A laugh from the sea

The past goes with the wind

Glory days

Notepad

May we all be blessed with longevity

But if we make it easier with the help of the sub () function, we can first remove the node with the sub () function, leaving only the text, and then extract it with findall ().

Html = re.sub ('|', html) print (html) reresults = re.findall ((. *?)', html, re.S) for result in results: print (result.strip ())

Running result:

List of classic old songs

There is a smile from you along the way, the past is glorious with the wind, the past is glorious, the notepad is willing to last for a long time.

You were there all the way.

A laugh from the sea

The past goes with the wind

Glory days

Notepad

May we all be blessed with longevity

You can find that the tags are gone after being processed by the sub () function, and then findall () can extract them directly. So at the right time, we can do some corresponding processing with the help of sub () method and get twice the result with half the effort.

Compile ()

The methods we talked about earlier are all methods for dealing with strings, and finally we introduce a compile () method, which compiles regular strings into regular expression objects for reuse in later matches.

Import re content1 = '2016-12-15 12 content3 55' content3 = '2016-12-22 13 content3 21' pattern = re.compile ('\ d {2}:\ d {2}') reresult1 = re.sub (pattern,', content1) reresult2 = re.sub (pattern,', content2) reresult3 = re.sub (pattern,', content3) print (result1, result2, result3)

For example, there are three dates, and we want to remove the time from each of the three dates, so here we can use the sub () method, the first parameter of the sub () method is a regular expression, but here we don't have to repeat three of the same regular expressions, so we can use the compile () function to compile the regular expression into a regular expression object for reuse.

Running result:

2016-12-15 2016-12-17 2016-12-22

In addition, compile () can also pass modifiers, such as re.S, so that there is no need for additional passing in methods such as search (), findall (), and so on. So the compile () method can be said to encapsulate the regular expression so that we can reuse it better.

The above is the editor for you to share how to use regular expressions in Python3 projects, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report