How to use regular expressions in python3 crawler 07/06 Update SLTechnology News&Howtos

How to use regular expressions in python3 crawler

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Editor to share with you how to use regular expressions in python3 crawler, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's learn about it!

Crawl the specified page with python:

The code is as follows:

Import urllib.requesturl= "http://www.baidu.com"data = urllib.request.urlopen (url). Read () # data = data.decode ('UTF-8') print (data)

The urllib.request.urlopen (url) official document returns a http.client.HTTPResponse object, which in turn uses the read () method; returns data; and this function returns a http.client.HTTPResponse object, which has various methods, such as the read () method we use.

Find a variable URL:

Import urllibimport urllib.requestdata= {} data ['word'] =' one peace'url_values=urllib.parse.urlencode (data) url= "http://www.baidu.com/s?"full_url=url+url_valuesa = urllib.request.urlopen (full_url) data=a.read () data=data.decode ('UTF-8') print (data) # # print out the URL: a.geturl ()

Data is a dictionary, and then converts data to a string of 'word=one+peace'' through urllib.parse.urlencode (), and finally merges with url into full_url

Introduction to python regular expressions:

Queue introduction

In the crawler program used the breadth priority algorithm, the algorithm uses the data structure, of course, you can also use list to achieve the queue, but the efficiency is not high. Now let's introduce it here: there is a queue in the container: collection.deque

# queue simple test:

From collections import deque

Queue=deque (["peace", "rong", "sisi"])

Queue.append ("nick")

Queue.append ("pishi")

Print (queue.popleft ())

Print (queue)

Collection introduction:

In the crawler, in order not to repeatedly climb those already crawled websites, we need to put the url of the crawled pages into the collection, and before each time we have to climb a url, we need to see if the collection already exists. If it already exists, we'll skip the url;. If it doesn't exist, we put the url in the collection and then climb the page.

Python also includes a data class type-set (set). Set combination is a set of disordered non-repetitive elements. The basic function includes the testing of the related system and the elimination of complex elements. Set pairs also support mathematical operations such as union (union), intersection (intersection), difference (difference) and sysmmetric difference (symmetric difference).

Curly braces or the set () function can be used to create collections. Note: to create an empty collection, you must use set () instead of {}. {} is used to create an empty dictionary

The creation of the collection is demonstrated as follows:

A = {"peace", "peace", "rong", "rong", "nick"}

Print (a)

"peace" in a

B=set (["peace", "peace", "rong", "rong"])

Print (b)

# demo federation

Print (a | b)

# presentation delivery

Print (aquib)

# poor presentation

Print (aMub)

# symmetric difference

Print (a ^ b)

# output:

{'peace',' rong', 'nick'}

{'peace',' rong'}

{'peace',' rong', 'nick'}

{'peace',' rong'}

{'nick'}

Regular expression

What is collected during the crawler is usually a character stream, from which to pick out url requires simple string processing, which can be easily accomplished with regular expressions.

Steps of regular expression: 1, compilation of regular expression 2, regular expression matching string 3, processing of result

The following figure lists the syntax of regular expressions:

To use regular expressions in pytho, you need to introduce the re module; here are some methods in this module

1.compile and match

In the re module, compile is used to generate the object of pattern, and then the match instance is finally obtained by calling the match method of the pattern instance to process the text; the information is obtained by using match

Import re# compiles the regular expression into a Pattern object pattern = re.compile (rnsyllovep') # uses Pattern to match text to get the matching result. If there is no match, it will return Nonem = pattern.match ('rlovep.com') if MRV # use Match to get grouping information print (m.group ()) # output # rlovepre.compile (strPattern [, flag]):

This method is the factory method of the Pattern class and is used to compile regular expressions in the form of strings into Pattern objects. The second parameter, flag, is a matching pattern, which can be taken by bitwise or by the operator'|'to indicate that it is in effect at the same time, such as re.I | re.M. Alternatively, you can specify the pattern in the regex string, for example, re.compile ('pattern', re.I | re.M) is equivalent to re.compile (' (? im) pattern').

Available values are:

Re.I (re.IGNORECASE): ignore upper and lower case (complete writing in parentheses, same below)

M (MULTILINE): multiline mode, changing the behavior of'^ 'and' $'(see figure above)

S (DOTALL): point any match pattern, change'.' The behavior of

L (LOCALE): make the predetermined character class\ w\ W\ b\ B\ s depending on the current locale

U (UNICODE): make the predetermined character class\ w\ W\ b\ B\ s\ S\ d\ D depend on the character attributes defined by unicode

X (VERBOSE): detailed mode. In this mode, regular expressions can be multi-line, ignore white space characters, and can be annotated.

The Match:Match object is the result of a match and contains a lot of information about the match, which can be obtained using the readable properties or methods provided by Match.

Attributes:

String: the text used in the match.

Re: the Pattern object used in the match.

Pos: the index in the text where the regular expression starts the search. The value is the same as the parameters of the Pattern.match () and Pattern.seach () methods with the same name.

Endpos: the index of the regular expression in the text that ends the search. The value is the same as the parameters of the Pattern.match () and Pattern.seach () methods with the same name.

Lastindex: the index of the last captured packet in the text. If no packet is captured, it will be None.

Lastgroup: the alias of the last captured packet. If the grouping does not have an alias or is not captured, it will be None.

Methods:

Group ([group1, …]) :

Gets one or more grouped intercepted strings; it is returned as a tuple when multiple parameters are specified. Group1 can use either a number or an alias; the number 0 represents the entire matching substring; if the parameter is left empty, group (0) is returned; the group that does not intercept the string returns the group that has intercepted many times and None; returns the last intercepted substring.

Groups ([default]):

Returns all grouped intercepted strings in tuples. It is equivalent to calling group (1, 2, … Last). Default means that groups that do not intercept strings are replaced by this value, which defaults to None.

Groupdict ([default]):

Returns a dictionary with the alias of the alias group as the key and the substring intercepted by the group as the value. Groups without aliases are not included. The meaning of default is the same as above.

Start ([group]):

Returns the starting index of the substring intercepted by the specified group in string (the index of the first character of the substring). The default value for group is 0.

End ([group]):

Returns the end index of the substring intercepted by the specified group in string (the index of the last character of the substring + 1). The default value for group is 0.

Span ([group]):

Return (start (group), end (group)).

Expand (template):

Replace the matched packet into template and return. You can use\ id or\ g,\ g references grouping in template, but you cannot use the number 0. \ id is equivalent to\ g; but\ 10 will be considered the 10th grouping, and if you want to express the character'0' after\ 1, you can only use\ G0.

The pattern:Pattern object is a compiled regular expression that matches the text through a series of methods provided by Pattern.

Pattern cannot be instantiated directly and must be constructed using re.compile ().

Pattern provides several readable properties to get information about the expression:

Pattern: the expression string used when compiling.

Flags: the matching pattern used at compile time. Digital form.

Groups: the number of groups in the expression.

Groupindex: a dictionary whose key is the alias of a group with an alias in the expression and whose value is the corresponding number of the group. Groups without aliases are not included.

Instance method [| re module method]:

Match (string [, pos [, endpos]]) | re.match (pattern, string [, flags]):

This method attempts to match the pattern; from the pos subscript of the string. If the pattern ends with a match, it returns a Match object; if the pattern does not match during the matching process, or if the match has reached endpos before the match is over, it returns None.

The default values for pos and endpos are 0 and len (string), respectively; re.match () cannot specify these two parameters, and the parameter flags is used to specify a matching pattern when compiling pattern.

Note: this method is not an exact match. When the pattern ends, if there are remaining characters in the string, it is still considered a success. To match exactly, you can add the boundary match'$'at the end of the expression.

Search (string [, pos [, endpos]]) | re.search (pattern, string [, flags]):

This method is used to find substrings in a string that can match successfully. Try to match pattern from the pos subscript of string, and return a Match object if it can still be matched at the end of pattern; if it cannot be matched, add pos to try again; if it cannot be matched until pos=endpos, None is returned. The default values for pos and endpos are 0 and len (string), respectively; re.search () cannot specify these two parameters, and the parameter flags is used to specify a matching pattern when compiling pattern.

Split (string [, maxsplit]) | re.split (pattern, string [, maxsplit]):

Split the string according to the substrings that can match and return to the list. Maxsplit is used to specify the maximum number of splits, not to specify that all will be split.

Findall (string [, pos [, endpos]]) | re.findall (pattern, string [, flags]):

Search for string and return all matching substrings in a list.

Finditer (string [, pos [, endpos]]) | re.finditer (pattern, string [, flags]):

Search for string and return an iterator that sequentially accesses each matching result (Match object).

Sub (repl, string [, count]) | re.sub (pattern, repl, string [, count]):

Replace each matching substring in the string with repl and return the replaced string. When repl is a string, you can use\ id or\ g,\ g reference grouping, but you cannot use the number 0. When repl is a method, the method should take only one parameter (the Match object) and return a string for replacement (grouping can no longer be referenced in the returned string). Count is used to specify the maximum number of replacements, and replace them all if not specified.

Subn (repl, string [, count]) | re.sub (pattern, repl, string [, count]):

Return (sub (repl, string [, count]), number of replacements).

2.re.match (pattern, string, flags=0)

Function parameter description:

Parameters.

Description

Pattern

Matching regular expression

String

The string to match.

Flags

Flag bits, which are used to control how regular expressions are matched, such as case sensitivity, multiline matching, and so on.

We can use the group (num) or groups () match object function to get the matching expression.

Matching object method

Description

Group (num=0)

Matching the entire expression string, group () can enter more than one group number at a time, in which case it will return a tuple containing the values corresponding to those groups.

Groups ()

Returns a tuple containing all the team strings, from 1 to the contained group number.

The demonstration is as follows:

# re.match.import reprint (re.match ("rlovep", "rlovep.com")) # # match rlovepprint (re.match ("rlovep", "rlovep.com"). Span () # # match rlovepprint from the beginning (re.match ("com", "http://rlovep.com"))## no longer start position cannot match successfully # # output: (0,6) None

Example 2: using group

Import reline = "This is my blog" # matches the string containing is matchObj = re.match (r'(. *) is (. *?). *', line, re.M | re.I) # uses the group output: when group takes no parameter, it matches the entire successful output # when the parameter is 1, it matches the first parenthesis on the outermost left, an analogy If matchObj: print ("matchObj.group ():", matchObj.group ()) # matches the entire print ("matchObj.group (1):", matchObj.group (1)) # matches the first parenthesis print ("matchObj.group (2):", matchObj.group (2)) # matches the second parenthesis else: print ("No matchmakers!") # output: matchObj.group (): This is my blogmatchObj.group (1): ThismatchObj.group (2): my

3.re.search method

Re.search scans the entire string and returns the first successful match.

Function syntax:

Re.search (pattern, string, flags=0)

Function parameter description:

Parameters.

Description

Pattern

Matching regular expression

String

The string to match.

Flags

Flag bits, which are used to control how regular expressions are matched, such as case sensitivity, multiline matching, and so on.

We can use the group (num) or groups () match object function to get the matching expression.

Matching object method

Description

Group (num=0)

Matching the entire expression string, group () can enter more than one group number at a time, in which case it will return a tuple containing the values corresponding to those groups.

Groups ()

Returns a tuple containing all the team strings, from 1 to the contained group number.

Example 1:

Import reprint (re.search ("rlovep", "rlovep.com"). Span () print (re.search ("com", "http://rlovep.com").span())# output: import reprint (re.search (" rlovep "," rlovep.com "). Span () print (re.search (" com "," http://rlovep.com").span()) ")

Example 2:

Import reline = "This is my blog" # matches the string containing is matchObj = re.search (r'(. *) is (. *?). *', line, re.M | re.I) # uses the group output: when group takes no parameter, it matches the entire successful output # when the parameter is 1, it matches the first parenthesis on the outermost left, an analogy If matchObj: print ("matchObj.group ():", matchObj.group ()) # matches the entire print ("matchObj.group (1):", matchObj.group (1)) # matches the first parenthesis print ("matchObj.group (2):", matchObj.group (2)) # matches the second parenthesis else: print ("No matchmakers!") # output: matchObj.group (): This is my blogmatchObj.group (1): ThismatchObj.group (2): my

The difference between search and match: re.match only matches the beginning of a string, and if the string does not match the regular expression at the beginning, the match fails, and the function returns None; and re.search matches the entire string until a match is found.

Python crawler has a small test

Use python to grab all the http protocol links in the page, and recursively grab the links to the sub-pages. Collections and queues are used; this is my website, and a lot of bug; code in the first version is as follows:

Import reimport urllib.requestimport urllibfrom collections import deque# uses queues to store url queue = deque () > the python3 entry series in front of you has also basically entered the python. From this chapter, we will introduce python's crawler tutorial and share it with you. What the crawler says simply is to grab the data of the network for analysis and processing. This chapter is mainly an introduction to a few small tests of crawlers, as well as an introduction to the tools used by crawlers, such as collections, queues, and regular expressions. # use visited to prevent repeated climbing of the same page visited = set () url = 'http://rlovep.com' # entry page, you can replace it with another # entry page queue.append (url) cnt = 0while queue: url = queue.popleft () # team first element visited | = {url} # marked as print visited (' crawled:'+ str (cnt) +'is being crawled

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.