Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the regular expression commonly used by crawlers in python

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the regular expression commonly used by reptiles in python". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "what is the regular expression commonly used by reptiles in python"!

There are four main steps to crawling:

·Be clear about your goals (know which area or website you're going to search for)

·Crawl (crawl all the content of the website)

·Take (remove data that is not useful to us)

·Processing data (storing and using it the way we want)

Then the most powerful thing in text filtering is regular expressions, and it is also an indispensable tool in the python crawler world.

What is a regular expression?

Regular expressions, also known as regular expressions, are usually used to retrieve and replace text that conforms to a pattern (rule).

Regular expression is a logical formula for string operation, that is, some specific characters defined in advance and combinations of these specific characters are used to form a "regular string". This "regular string" is used to express a filtering logic for string.

Given a regular expression and another string, we can achieve the following:

Whether a given string matches the filter logic of a regular expression ("matches");

·Get the specific part we want from a text string ("filtering") through regular expressions.

Regular expression matching rules

Python's re module

In Python, we can use the built-in re module to use regular expressions.

One thing to note in particular is that regular expressions use escape for special characters, so if we want to use the original string, just add an r prefix:

The general steps for using the re module are as follows:

Compile the string form of a regular expression into a Pattern object using the compile() function

·Match the text through a series of methods provided by the Pattern object to obtain a match result, a Match object.

·Finally, use the attributes and methods provided by the Match object to obtain information and perform other operations as needed

compile function

The compile function is used to compile regular expressions and generate a Pattern object. Its general use form is as follows:

import re

#Compile regular expressions into Pattern objects

pattern = re.compile(r'\d+')

In the above, we have compiled a regular expression into a Pattern object. Next, we can use a series of methods of pattern to match the text.

Some common methods for Pattern objects are:

· Match method: start from the starting position, match once

Search method: start from any position, match once

· findall method: match all, return list

Finditer method: all matches, return iterator

· Split method: Split string, return list

sub Method: Replace

match method

match

Method is used to find the head of a string (you can also specify the starting position), it is a match, as long as a match is found and returned, rather than looking for all the matching results. Its general usage forms are as follows:

match(string[, pos[, endpos]])

where string is the string to match, pos and endpos are optional parameters specifying the start and end positions of the string, default values are 0 and len, respectively

(string length). Therefore, when you do not specify pos and endpos, the match method matches the head of the string by default.

Returns a Match object when there is a match, or None if there is no match.

>>> import re

>>> pattern = re.compile(r'\d+') #Used to match at least one number

>>> m = pattern.match ('okk 12hellohai34 fine')#Find header, no match

>>> print m

None

>>> m = pattern.match(' okk12hellohai34fine ', 2, 10) #

Match from position 'k', no match

>>> print m

None

>>> m = pattern.match(' okk12hellohai34fine ', 3, 10) #

Match from position '1' exactly matches

>>> print m #Returns a Match object

>>> m.group (0) #Omit 0

'12'

>>> m.start(0) #0 can be omitted

3

>>> m.end(0) #0 can be omitted

5

>>> m.span(0) #0 can be omitted

(3, 5)

Above, a Match object is returned when the match is successful, where:

· group([group1, …]) method is used to obtain one or more matching strings in groups. When you want to obtain the whole matching substring, you can use group() or group directly.

group(0);

The start([group]) method is used to obtain the starting position of the substring matched by the group in the whole string (the index of the first character of the substring), and the default value of the parameter is 0;

The end([group]) method is used to obtain the end position of the substring matched by the group in the whole string (the index of the last character of the substring +1), and the default value of the parameter is 0;

span([group]) method returns (start(group), end(group)).

Here's another example:

>>> import re

>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I) # re.I

means ignore case

>>> m = pattern.match('Hello World Wife Web')

>>> print m #match succeeded, return a Match object

>>> m.group (0) #Returns the entire substring that matches successfully

'Hello World'

>>> m.span(0) #Returns the index of the entire substring that matches successfully

(0, 11)

>>> m.group (1) #Returns the substring of the first successful grouping match

'Hello'

>>> m.span(1) #Returns the index of the substring whose first grouping matches successfully

(0, 5)

>>> m.group (2) #Returns the substring of the second group matching successfully

'World'

>>> m.span(2) #Returns the substring whose second grouping matches successfully

(6, 11)

>>> m.group s() #Equivalent to (m.group(1), m.group(2),...)

('Hello', 'World')

>>> m.group (3) #No third grouping exists

Traceback (most recent call last):

File "", line 1, in

IndexError: no such group

search method

The search method is used to find any position in the string. It is also a match. It returns as long as it finds a match instead of finding all the matching results. Its general use form is as follows:

search(string[, pos[, endpos]])

where string is the string to match, pos and endpos are optional parameters specifying the start and end positions of the string, default values are 0 and len, respectively

(string length).

Returns a Match object when there is a match, or None if there is no match.

Let's look at an example:

>>> import re

>>> pattern = re.compile('\d+')

>>> m = pattern.search ('okk 12hellohai34 fine')#here if match is used

Methods do not match

>>> m

>>> m.group()

'12'

>>> m = pattern.search(' okk12hellohai34fine ', 10, 30) #

Specify string interval

>>> m

>>> m.group()

'34'

>>> m.span()

(13, 15)

Here's another example:

import re

#Compile regular expressions into Pattern objects

pattern = re.compile(r'\d+')

#Use search() to find matching substrings, return None if there is no matching substring

#match() cannot be used here

m = pattern.search('hello 123456 789')

if m:

#Use Match to get grouping information

print 'matching string:',m.group()

#Start and end positions

print 'position:',m.span()

Implementation results:

matching string: 123456

position: (6, 12)

findall method

Match and search above

Methods are one-time matches, as long as a match is found on the return of results. However, most of the time we need to search the entire string to get all the matching results.

The findall method can be used as follows:

findall(string[, pos[, endpos]])

where string is the string to match, pos and endpos are optional parameters specifying the start and end positions of the string, default values are 0 and len, respectively

(string length).

findall returns a list of all matching substrings, or an empty list if there is no match.

Take an example:

import re

pattern = re.compile(r'\d+') #Find numbers

result1 = pattern.findall('hello 123456 789')

result2 = pattern.findall('one1two2three3four4', 0, 10)

print result1

print result2

Implementation results:

['123456', '789']

['1', '2']

finditer method

The behavior of the finditer method is similar to that of findall, which searches the entire string and gets all matches. But it returns a sequential access to each match result (Match

Object).

split method

The split method splits a string into matching substrings and returns a list. It can be used as follows:

split(string[, maxsplit])

Maxsplit is used to specify the maximum number of splits, not to specify that all splits will be made.

Take an example:

import re

p = re.compile(r'[\s\,\;]+')

print p.split('a,b;; c d')

Implementation results:

['a', 'b', 'c', 'd']

sub method

The sub method is used for substitution. It is used in the following ways:

sub(repl, string[, count])

Where repl can be a string or a function:

·If repl is a string, repl is used to replace each matching substring of the string and return the replaced string. In addition, repl can also use id

but cannot use the number 0;

If repl is a function, the method should take only one argument (Match object) and return a string for substitution (no more grouping references in the returned string).

· Count is used to specify the maximum number of substitutions, if not specified, replace all.

Take an example:

import re

p = re.compile(r'(\w+) (\w+)') # \w = [A-Za-z0-9]

s = 'hello 123, hello 456'

print p.sub(r'hello world', s) #Replace 'hello 123' and 'hello' with 'hello world'

456'

print p.sub(r'\2\1', s) #Reference grouping

def func(m):

return 'hi' + ' ' + m.group(2)

print p.sub(func, s)

print p.sub(func, s, 1) #Replace at most once

Implementation results:

hello world, hello world

123 hello, 456 hello

hi 123, hi 456

hi 123, hello 456

At this point, I believe that everyone has a deeper understanding of "what is the regular expression commonly used by reptiles in python", so let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report