Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use regular expressions in python Web Crawler

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces how to use regular expressions in python web crawler. It is very detailed and has a certain reference value. Friends who are interested must read it!

1. Common matching rules

2. Common matching methods 1. Match ()

The match () method matches from the beginning of the string, which takes two parameters, the first is a regular expression, and the second is the string to be matched

Re.match (regular expressions, strings)

If the method matches successfully, it returns a SRE_Match object, or None if it does not match.

After the return is successful, there are two methods, the group () method is used to view the matching string, and the span () method is used to output the matching range.

Import recontent = 'Hello_World,123 456'result = re.match (' ^ Hello\ w {6}\ W\ d\ d {3}', content) print (result) print (result.group ()) print (result.span ())

[running result]

Hello_World,123 456

(0,19)

Substring matching

We have matched the complete string above, but only part of it may be needed in the actual requirement, so we only need to add parentheses when the substring to be obtained matches.

Import recontent = 'Hello_World,123 456'result = re.match (' ^ Hello\ w {6}\ W (\ d +)\ s (\ d {3})', content) print (result) print (result.group ()) print (result.span ()) print (result.group (1)) print (result.group (2))

[running result]

Hello_World,123 456

(0,19)

one hundred and twenty three

four hundred and fifty six

This matches the numbers in the string in the form of parentheses.

Universal matching character

. * among them. Used to match any character (except the newline character), and * represents an infinite number of characters that appear before. Therefore, the previous matching form can be written as:

Import recontent = 'Hello_World,123 456'result = re.match (' ^ Hello\. * 456 $', content) print (result.group ())

[running result]

Hello_World,123 456

Greedy matching and non-greedy matching

. * matching is greedy matching.

. *? Non-greedy matching

The main difference between the two is that greedy matching matches as many characters as possible, while greedy matching matches as few characters as possible. The following code gives a more intuitive understanding of the difference between the two

Import recontent = 'number 12345678 test'result_1 = re.match (' ^ number.* (\ d +). * test$',content) print ('greedy matching number:' + result_1.group (1)) result_2 = re.match ('^ number.*? (\ d +). * test$',content) print ('non-greedy matching number:' + result_2.group (1))

[running result]

The number obtained by greedy matching: 8

Number obtained by non-greedy match: 12345678

There is a question about why greedy matches get fewer numbers than greed matches more, which is not quite in line with what I said earlier.

Note that when matching, greedy matching is to match as many characters as possible, so. * the matching is' 1234567', leaving only 8 for\ d + matching, and non-greedy matching is as few matching characters as possible, so. *? If the match is'', leave 12345678 for\ d + match and you will get the above result.

Modifier

Modifier function re.I ignores case matching re.L does localization recognition matching re.M multiline matching, affects ^ and $re.S to make. Match all characters including newline characters re.U parses characters according to the Unicode character set re.X is more flexible in writing regular expressions

Escape matching

When matching a special character, precede it with a backslash (\) to complete the escape match.

2. Search ()

When matching, the entire string is scanned and the first successful match is returned. If there is no match after searching the entire string, None is returned.

3. Findall ()

Unlike search (), the findall () method returns everything that matches the regular expression. The return result is a list, and each element in the list is a tuple type.

4. Sub ()

To modify the content of the text, the principle is to replace the content to be modified.

Import retemp = "abcdef123ghi456" temp = re.sub ("\ d +", ", temp) print (temp)

[running result]

Abcdefghi

In the parameter analysis in sub (), the first parameter is that the regular expression matches the content to be changed, the second parameter is replaced with the content of that parameter, and the third parameter is the string to be changed.

5. Compile ()

Compiles a regular string into a regular expression object for reuse in subsequent matches.

The above is all the content of the article "how to use regular expressions in python Web Crawler". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report