Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use regular expressions of Python

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article focuses on "how to use regular expressions of Python". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to use Python's regular expressions".

I. introduction of regular expressions

1. Why do you have to know regular expressions to learn crawlers?

Sometimes, when we crawl some specific content of a web page, we will find that when we only need part of the content of a tag on this page, or the value of an attribute of this tag, it is impossible to realize our idea with ordinary xpath or css.selector, so we must use regular expressions to match and get.

two。 Official introduction to regular expressions?

Regular expression, also known as regular expression. (English: Regular Expression, often abbreviated as regex, regexp or RE in code), a concept in computer science. Regular expressions are often used to retrieve and replace text that conforms to a certain pattern (rule).

Look at the code and memorize regular expressions while learning

Day01:

1 million dollars'

2author: minimalist XksA

3date: 2018.7.27

4goal: regular expression

5 years old'

six

7import re

eight

9line = 'jijianXksA123'

ten

1 match ^ a means to match a string that begins with a (match only once)

12#. Indicates that the character can be any character (matches only once)

1 characters * indicates that the preceding characters can appear any number of times (0 or more times) (multiple matches)

14reg_str01 ='^ j.string'# represents a string that begins with j

Inverse re.match function

The first parameter is the matching format

The second parameter is the string to match.

Return value: match successfully, return match object, otherwise return: None

nineteen

20if re.match (reg_str01,line):

21 print ("match succeeded!") # reg_str ='^ j.matches' match successfully

22else:

23 print ("match failed!") # reg_str ='^ I. match 'failed

twenty-four

twenty-five

2matches 23$ to match a string that ends in 23 (matches only once)

27reg_str02 ='^ j.room23 $'

28if re.match (reg_str02,line):

29 print ("match!") # reg_str ='^ j.room23 $'matched successfully

30else:

31 print ("match failed!") # reg_str ='^ j.matching 13 $'failed

thirty-two

thirty-three

34line01 = 'boooboaobxby'

The matching pattern in 3 matches () is the matching pattern, and the matching result can be obtained through the group function.

3The regular expression greedy matching pattern: start matching from the back (right)

37reg_str03 ='. * (B. roomb). *'

38test01 = re.match (reg_str03,line01)

39if test01:

40 print (test01.group (1)) # result: bxb

41else:

42 print ("match failed!")

forty-three

4 regular expression non-greedy matching pattern: match from the front (left)

4 matches: means to match from the left to the first content that matches the pattern, that is, to enter the pattern

46#

47reg_str03 ='. *? (B. accounb). * # semi-greedy matching

48reg_str04 ='. *? (B. match match).

49test01 = re.match (reg_str03,line01)

50test02 = re.match (reg_str04,line01)

51if test01 and test02:

52 print (test01.group (1)) # result: boooboaobxb

53 print (test02.group (1)) # result: booob

54else:

55 print ("match failed!")

Day02:

1 million dollars'

2author: minimalist XksA

3date: 2018.7.28

4goal: regular expression

5 years old'

6import re

7line01 = 'boooboaobcxby'

eight

9def regtest (reg_str,line = line01):

10 test = re.match (reg_str, line)

11 if test:

12 print (test.group (1))

13 else:

14 print ("match failed!")

fifteen

Character +: indicates the preceding character, which appears at least once

17reg_str04 ='. * (b. Roomb). *'# (b. Roomb) indicates that there is at least one character between b and b

18regtest (reg_str04) # result: bcxb

nineteen

2characters {n}: controls the number of occurrences of preceding characters

2rooma {2}: indicates that an appears twice

2roomb {3jing4}: indicates that b appears at least 3 times and 4 times at most

2roomc {4,}: indicates that c appears at least 4 times

24reg_str05 ='. * (b. {2} b). *'# (b. {2} b) indicates that there are only two characters between the matched b and b.

25reg_str06 ='. * (b. {3Magne4} b). *'# (b. {3Power6} b) indicates that there are at least 3 characters and at most 4 characters between the matched b and b

26reg_str07 ='. * (b. {4,} b). *'# (b. {8,} b) indicates that there are at least 4 characters between the matched b and b

27regtest (reg_str05) # result: bcxb

28regtest (reg_str06) # result: boaob

29regtest (reg_str07) # result: boaobcxb

thirty

3percent |: indicates or

3 matches (abc | 123): it indicates that the match is abc or 123, and the match is considered successful.

33reg_str08 ='. * (boo | abc)'

34reg_str09 ='. * (abc | boo)'

35regtest (reg_str08) # result: boo

36regtest (reg_str09) # result: boo

thirty-seven

3 matches []: indicates that all the contents can be matched, and the contents have only superficial character meaning.

3 matches [abcd]: indicates that as long as this character is one of the a/b/c/d, it can be matched successfully.

4 match [0-9]: indicates that the character can be matched successfully as long as the character is in the range of 0-9

4 match [^ x]: indicates that the matching character is not x

42line02 = 'telephone number: 15573563467'

43reg_str10 ='. * (1 [3458] [0-9] {9}).

44reg_str11 ='. * (1 [3458] [^ 1] {9}).

45regtest (reg_str10,line02) # result: 15573563467

46regtest (reg_str11,line02) # result: 15573563467

forty-seven

4 matches\ s means matching spaces. Match once.

4 matches\ S means to match characters that are not spaces. Match once.

5 match\ w means to match the easy characters in Amurz, 0-9, _. Match once.

5mm\ W is the opposite of\ w

5percent\ d means a number.

5characters [\ u4E00 -\ u9FA5]: indicates all Chinese characters, encoded by unicode

fifty-four

55def regtest_test (reg_str,line = line01):

56 test = re.match (reg_str, line)

57 if test:

58 print (test.group (1) +':'+ test.group (2) +'-'+ test.group (3) +'-'+ test.group (4))

59 else:

60 print ("match failed!")

sixty-one

6. Simple example

63str01 = 'Zhang San was born on December 20, 1997'

64str02 ='Li Si was born in 1989-01-20'

65str03 = 'Wang Wu was born in 1997.

66str04 = 'Zhao Liufa was born on December 20, 1997.'

67str = [str01,str02,str03,str04]

6. Extract the name + date of birth

6cm matching pattern

70reg_str12 ='(. *) was born in (\ d {4}) [. Year / -] (\ d {1pm 2}) [. Month / -] (\ d {1J 2}). *?'

71for i in range (4):

72 regtest_test (reg_str12,str [I])

73# result:

7 Zhang San: 1997-12-20

Li Si: 1989-01-20

7. Wang Wu: 1997-2-5

Zhao Liu: 1997-12-20 here, I believe you have a deeper understanding of "how to use the regular expression of Python". You might as well do it in practice! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report