In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what is the regular expression commonly used by reptiles in python". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "what is the regular expression commonly used by reptiles in python"!
There are four main steps to crawling:
·Be clear about your goals (know which area or website you're going to search for)
·Crawl (crawl all the content of the website)
·Take (remove data that is not useful to us)
·Processing data (storing and using it the way we want)
Then the most powerful thing in text filtering is regular expressions, and it is also an indispensable tool in the python crawler world.
What is a regular expression?
Regular expressions, also known as regular expressions, are usually used to retrieve and replace text that conforms to a pattern (rule).
Regular expression is a logical formula for string operation, that is, some specific characters defined in advance and combinations of these specific characters are used to form a "regular string". This "regular string" is used to express a filtering logic for string.
Given a regular expression and another string, we can achieve the following:
Whether a given string matches the filter logic of a regular expression ("matches");
·Get the specific part we want from a text string ("filtering") through regular expressions.
Regular expression matching rules
Python's re module
In Python, we can use the built-in re module to use regular expressions.
One thing to note in particular is that regular expressions use escape for special characters, so if we want to use the original string, just add an r prefix:
The general steps for using the re module are as follows:
Compile the string form of a regular expression into a Pattern object using the compile() function
·Match the text through a series of methods provided by the Pattern object to obtain a match result, a Match object.
·Finally, use the attributes and methods provided by the Match object to obtain information and perform other operations as needed
compile function
The compile function is used to compile regular expressions and generate a Pattern object. Its general use form is as follows:
import re
#Compile regular expressions into Pattern objects
pattern = re.compile(r'\d+')
In the above, we have compiled a regular expression into a Pattern object. Next, we can use a series of methods of pattern to match the text.
Some common methods for Pattern objects are:
· Match method: start from the starting position, match once
Search method: start from any position, match once
· findall method: match all, return list
Finditer method: all matches, return iterator
· Split method: Split string, return list
sub Method: Replace
match method
match
Method is used to find the head of a string (you can also specify the starting position), it is a match, as long as a match is found and returned, rather than looking for all the matching results. Its general usage forms are as follows:
match(string[, pos[, endpos]])
where string is the string to match, pos and endpos are optional parameters specifying the start and end positions of the string, default values are 0 and len, respectively
(string length). Therefore, when you do not specify pos and endpos, the match method matches the head of the string by default.
Returns a Match object when there is a match, or None if there is no match.
>>> import re
>>> pattern = re.compile(r'\d+') #Used to match at least one number
>>> m = pattern.match ('okk 12hellohai34 fine')#Find header, no match
>>> print m
None
>>> m = pattern.match(' okk12hellohai34fine ', 2, 10) #
Match from position 'k', no match
>>> print m
None
>>> m = pattern.match(' okk12hellohai34fine ', 3, 10) #
Match from position '1' exactly matches
>>> print m #Returns a Match object
>>> m.group (0) #Omit 0
'12'
>>> m.start(0) #0 can be omitted
3
>>> m.end(0) #0 can be omitted
5
>>> m.span(0) #0 can be omitted
(3, 5)
Above, a Match object is returned when the match is successful, where:
· group([group1, …]) method is used to obtain one or more matching strings in groups. When you want to obtain the whole matching substring, you can use group() or group directly.
group(0);
The start([group]) method is used to obtain the starting position of the substring matched by the group in the whole string (the index of the first character of the substring), and the default value of the parameter is 0;
The end([group]) method is used to obtain the end position of the substring matched by the group in the whole string (the index of the last character of the substring +1), and the default value of the parameter is 0;
span([group]) method returns (start(group), end(group)).
Here's another example:
>>> import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I) # re.I
means ignore case
>>> m = pattern.match('Hello World Wife Web')
>>> print m #match succeeded, return a Match object
>>> m.group (0) #Returns the entire substring that matches successfully
'Hello World'
>>> m.span(0) #Returns the index of the entire substring that matches successfully
(0, 11)
>>> m.group (1) #Returns the substring of the first successful grouping match
'Hello'
>>> m.span(1) #Returns the index of the substring whose first grouping matches successfully
(0, 5)
>>> m.group (2) #Returns the substring of the second group matching successfully
'World'
>>> m.span(2) #Returns the substring whose second grouping matches successfully
(6, 11)
>>> m.group s() #Equivalent to (m.group(1), m.group(2),...)
('Hello', 'World')
>>> m.group (3) #No third grouping exists
Traceback (most recent call last):
File "", line 1, in
IndexError: no such group
search method
The search method is used to find any position in the string. It is also a match. It returns as long as it finds a match instead of finding all the matching results. Its general use form is as follows:
search(string[, pos[, endpos]])
where string is the string to match, pos and endpos are optional parameters specifying the start and end positions of the string, default values are 0 and len, respectively
(string length).
Returns a Match object when there is a match, or None if there is no match.
Let's look at an example:
>>> import re
>>> pattern = re.compile('\d+')
>>> m = pattern.search ('okk 12hellohai34 fine')#here if match is used
Methods do not match
>>> m
>>> m.group()
'12'
>>> m = pattern.search(' okk12hellohai34fine ', 10, 30) #
Specify string interval
>>> m
>>> m.group()
'34'
>>> m.span()
(13, 15)
Here's another example:
import re
#Compile regular expressions into Pattern objects
pattern = re.compile(r'\d+')
#Use search() to find matching substrings, return None if there is no matching substring
#match() cannot be used here
m = pattern.search('hello 123456 789')
if m:
#Use Match to get grouping information
print 'matching string:',m.group()
#Start and end positions
print 'position:',m.span()
Implementation results:
matching string: 123456
position: (6, 12)
findall method
Match and search above
Methods are one-time matches, as long as a match is found on the return of results. However, most of the time we need to search the entire string to get all the matching results.
The findall method can be used as follows:
findall(string[, pos[, endpos]])
where string is the string to match, pos and endpos are optional parameters specifying the start and end positions of the string, default values are 0 and len, respectively
(string length).
findall returns a list of all matching substrings, or an empty list if there is no match.
Take an example:
import re
pattern = re.compile(r'\d+') #Find numbers
result1 = pattern.findall('hello 123456 789')
result2 = pattern.findall('one1two2three3four4', 0, 10)
print result1
print result2
Implementation results:
['123456', '789']
['1', '2']
finditer method
The behavior of the finditer method is similar to that of findall, which searches the entire string and gets all matches. But it returns a sequential access to each match result (Match
Object).
split method
The split method splits a string into matching substrings and returns a list. It can be used as follows:
split(string[, maxsplit])
Maxsplit is used to specify the maximum number of splits, not to specify that all splits will be made.
Take an example:
import re
p = re.compile(r'[\s\,\;]+')
print p.split('a,b;; c d')
Implementation results:
['a', 'b', 'c', 'd']
sub method
The sub method is used for substitution. It is used in the following ways:
sub(repl, string[, count])
Where repl can be a string or a function:
·If repl is a string, repl is used to replace each matching substring of the string and return the replaced string. In addition, repl can also use id
but cannot use the number 0;
If repl is a function, the method should take only one argument (Match object) and return a string for substitution (no more grouping references in the returned string).
· Count is used to specify the maximum number of substitutions, if not specified, replace all.
Take an example:
import re
p = re.compile(r'(\w+) (\w+)') # \w = [A-Za-z0-9]
s = 'hello 123, hello 456'
print p.sub(r'hello world', s) #Replace 'hello 123' and 'hello' with 'hello world'
456'
print p.sub(r'\2\1', s) #Reference grouping
def func(m):
return 'hi' + ' ' + m.group(2)
print p.sub(func, s)
print p.sub(func, s, 1) #Replace at most once
Implementation results:
hello world, hello world
123 hello, 456 hello
hi 123, hi 456
hi 123, hello 456
At this point, I believe that everyone has a deeper understanding of "what is the regular expression commonly used by reptiles in python", so let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.