How to realize crawler in Python automatic development and learning 04/27 Update SLTechnology News&Howtos

How to realize crawler in Python automatic development and learning

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces Python automation development and learning how to achieve crawlers, the article is very detailed, has a certain reference value, interested friends must read it!

Lecturer's blog: https://www.cnblogs.com/wupeiqi/articles/6283017.html

Establish a local cache

A page can be crawled down with the following command. However, the crawled content is cached locally before proceeding with other operations:

Import requestsr = requests.get ('http://www.autohome.com.cn/news') # crawl page print (r.text) # print response content

There are a lot of ways to try this, but avoid climbing the same page every time. The main climbers are so frequent that I don't know if they will be blocked. So after crawling once, set up the cache locally, and then you don't have to crawl again.

What you want to cache is r = requests.get ('http://www.autohome.com.cn/news')), which is the r object here. Without caching, r is saved in memory, and once the program exits, it will be gone. All you have to do here is serialize the object r and save it as a local file. Since r is a python object, it cannot be serialized using JSON, so you can use pickle to save it as a binary file.

Serialization and deserialization

The first is to serialize the object and save it as a local binary file:

Import picklewith open ('test.pk',' wb') as f: pickle.dump (r, f)

Only when it is reused, there is no need to climb again through requests.get, but directly extract the contents from the local file in reverse order to generate r objects:

Import picklewith open ('test.pk',' rb') as f: r = pickle.load (f) encapsulates a module

Then, every time you have to think about whether you have cached before, it is also troublesome, so in the encapsulation, automatically determine whether it has been cached or not. If not, climb the web page and generate a cache. If there is, read it in the cached file.

Create a folder called "contention" to store cached files. Assuming that the python file you are testing is s1.py, then a cache file for pk/s1.pk is generated. As long as you determine whether the file exists, you can know whether it has been cached:

Import osimport pickleimport requestsdef get_pk_name (path): basedir = os.path.dirname (path) fullname = os.path.basename (path) name = os.path.splitext (fullname) [0] pk_name ='% s Greater PKG% s.% s'% (basedir, name, 'pk') return pk_namepk_name = get_pk_name (_ _ file__) response = Noneif os.path.exists (pk_name): print ("already crawled Get the contents of the cache. ") With open (pk_name, 'rb') as f: response = pickle.load (f) # crawl only if the page has not been cached if not response: print ("start crawling page...") Response = requests.get ('http://www.autohome.com.cn/news') # remember to save after climbing, next time you don't have to climb with open (pk_name,' wb') as f: pickle.dump (response, f) # write the real code print (response.text) Requests from here

Chinese official document: http://cn.python-requests.org/zh_CN/latest/user/quickstart.html

Install the module:

Pip install requests

Send a request

R = requests.get ('http://www.autohome.com.cn/news')

Read response content

Print (r.text)

Text coding

There may be garbled code on it, that is, the code is wrong, you can check the current code, or you can change it. The default encoding is' ISO-8859-1':

Print (r.encoding) r.encoding = 'ISO-8859-1'

In addition, you can automatically obtain the coding of the page to solve the garbled problem:

R.encoding = r.apparent_encodingprint (r.text)

Binary response content

If you want to find your own code, you should also find it in here.

Print (r.content)

When downloading, the binary response content will be used.

Response status code

Print (r.status_code)

The status code returned normally is 200.

Cookie_obj = r.cookiescookie_dict = r.cookies.get_dict ()

R.cookies is an object that behaves like a dictionary and can also be used like an object. You can also use the get_dict () method to convert to a native dictionary here.

Beautiful Soup

Chinese official document: https://beautifulsoup.readthedocs.io/

Install the module:

Pip install beautifulsoup4

Here, continue to analyze the crawled content above, and first transfer the coding to the crawled content, and then analyze the response content of the r.text text:

Import requestsfrom bs4 import BeautifulSoupr = requests.get ('http://www.autohome.com.cn/news')r.encoding = r.apparent_encodingsoup = BeautifulSoup (r.text, features='html.parser')

The features parameter specifies a processing engine, which is default and inefficient, but does not require additional installation. If it is a production environment, there is a more efficient processing engine.

Here you finally get a soup object, followed by a series of methods that can extract all kinds of content.

Search method

Soup.find method, you can find the first object that meets the criteria. You can find tags, id, etc., or you can use multiple conditions:

Soup.find ("div") soup.find (id= "link3") soup.find ("div", id= "link3")

The soup.find_all method is used in the same way as find. In fact, the implementation of the find method also calls the find_all method. The find_all method returns all objects that meet the criteria, and the returned objects are in a list.

Print objects and the text of objects

Printing an object directly prints the entire html label. If you only need the text in the label, you can use the object's text property:

Soup = BeautifulSoup (r.text, features='html.parser') target = soup.find ('div', {' class': "article-bar"}) print (type (target), target, target.text)

Get all the properties of the object

In the attrs property of the object are all the attributes of the html tag:

Target = soup.find (id='auto-channel-lazyload-article') print (target.attrs)

Get the value of the attribute

Using the get method, you can get the corresponding value through the key of the attribute. The following two methods are fine:

V1 = target.get ('name') v2 = target.attrs.get (' value') # get source code def get (self, key, default=None): "Returns the value of the 'key' attribute for the tag, or the value given for' default' if it doesn't have that attribute." Return self.attrs.get (key, default) actual combat

With only this knowledge above, we can start the actual combat below.

Climb the new website of Automobile House to consult

The following is the code that finds the address and title of the a link without a news consultation, and finally downloads the corresponding picture to the local (first create an img folder):

# check_cache.py "" small module used to check whether there is a local cache "" import osdef get_pk_name (path): basedir = os.path.dirname (path) fullname = os.path.basename (path) name = os.path.splitext (fullname) [0] pk_name ='% s pk_name% s.% s'% (basedir, name 'pk') return pk_name# s1.py "" crawl Auto Home New website Consulting "" import osimport pickleimport requestsfrom bs4 import BeautifulSoupfrom check_cache import get_pk_namepk_name = get_pk_name (_ _ file__) response = Noneif os.path.exists (pk_name): print ("already crawled Get the contents of the cache. ") With open (pk_name, 'rb') as f: response = pickle.load (f) # crawl only if the page has not been cached if not response: print ("start crawling page...") Response = requests.get ('http://www.autohome.com.cn/news') # remember to save after climbing, next time you don't have to crawl with open (pk_name,' wb') as f: pickle.dump (response, f) response.encoding = response.apparent_encoding # to get the page code Solve the garbled problem # print (response.text) soup = BeautifulSoup (response.text Features='html.parser') target = soup.find (id='auto-channel-lazyload-article') # print (target) # obj = target.find ('li') # print (obj) li_list = target.find_all (' li') # print (li_list) for i in li_list: a = i.find ('a') # print (a) # print (a.attrs) # some li tags do not have a tag So you may get an error if a: # it's good to judge # print (a.attrs) # this is a dictionary print (a.attrs.get ('href')) # then use the method of manipulating the dictionary to get the value # tittle = a.find (' h4') # this type is the object tittle = a.find ('h4'). Text # this is the way to get the text print (tittle Type (tittle)) # but it's pretty much printed. Will become a string, the difference is that the h4 tag img_url = a.find ('img'). Attrs.get (' src') print (img_url) # gets the url of the picture. You can now download img_response = requests.get ("http:%s"% img_url) if'/'in tittle: file_name = "img/%s%s"% (tittle.replace ('/','_'), os.path.splitext (img_url) [1]) else: file_name = "img/%s%s"% (tittle Os.path.splitext (img_url) [1]) with open (file_name, 'wb') as f: f.write (img_response.content) login drawer

Here is a login problem to be solved.

There are two types of login, one is Form form validation, and the other is AJAX request. This is a website that uses AJAX for login requests.

Here are a few screenshots of the browser debugging tool, mainly to find out where the login request needs to be submitted, what information to submit, and what will be returned.

AJAX request for login:

Request body:

Response body:

The code for the login request is as follows:

Import requestspost_dict = {'phone':' 8613507293881, # found in the body of the request, 86 'password':' 123456 will be added before the mobile phone number,} # all the request headers can be found in the request header, but it is not necessary for headers = {'User-Agent':', # the website needs to verify the request header, but as long as it is available, it can be known from the header. Requested url and requested method response = requests.post (url=' https://dig.chouti.com/login', data=post_dict, headers=headers,) print (response.text) # there is also the returned cookies information. The key to a successful login is to get a successful cookiecookie_dict = response.cookies.get_dict () print (cookie_dict).

The routine of login

The wrong user name and password are used above. Before proceeding with login authentication, take a look at the login mechanism.

Login is bound to submit authentication information, usually with a user name and password. Then after requesting authentication, the server records a session and then returns a cookie to the client. After that, the user carries the cookie with him for each request, and the server knows that the request was submitted by that user after receiving the request.

However, this site is a little different, when users submit authentication information, they have to submit not only a user name and password, but also a gpsd. Then after the server verification is passed, the gpsd received this time will be recorded. The user will be able to verify with this gpsd in the cookie after the user. The gpsd of the authentication request can be obtained from the returned cookie of the first get request sent. In addition, after the user verification is passed, the server will return a cookie, and there is also a gpsd in this cookie, but it is a new gpsd, and it is useless. This will confuse us and cause some trouble during verification.

How to deal with this kind of special situation, you can only use the browser, open the debugging tool, and then try it bit by bit.

The following is login verification. Get the title and id of the first query, and send the post request to like:

Import requestsfrom bs4 import BeautifulSoupheaders = {'User-Agent':', # this website wants to verify this request header, but as long as it is available, you can use} R1 = requests.get ('https://dig.chouti.com', headers=headers) r1_cookies = r1.cookies # there is a gpsd here Print (r1_cookies.get_dict ()) # cannot upload password with open ('password/s2.txt') as f: auth = f.read () auth = auth.split ('\ n') post_dict = {'phone':' 86% s'% auth [0], # found in the request body 86 'password': auth [1],} # will be added before the mobile phone number. The login mechanism of this website is to send authentication information and gpsd in cookies. After successfully authorizing your gpsd #, only the authorized gpsd in cookies can be authenticated by R2 = requests.post (url=' https://dig.chouti.com/login', data=post_dict, headers=headers, cookies= {'gpsd': r1_cookies [' gpsd']}) print (r2.text) r2_cookies = r2.cookies # here will also return a new gpsd, but useless. Print (r2_cookies.get_dict ()) # get advice Then like R3 = requests.get (url=' https://dig.chouti.com', headers=headers, cookies= {'gpsd': r1_cookies [' gpsd']},) r3.encoding = r3.apparent_encodingsoup = BeautifulSoup (r3.text, features='html.parser') target = soup.find (id='content-list') item = target.find ('div', {' class': 'item'}) # just give the first like news = item.find (' a' {'class':' show-content'}). TextlinksId = item.find ('div', {' class': 'part2'}). Attrs [' share-linkid'] print ('news:', news.strip ()) # like r = requests.post (url=' https://dig.chouti.com/link/vote?linksId=%s'% linksId, headers=headers, cookies= {' gpsd': r1_cookies ['gpsd'] }) print (r.text) Requests module details

Find the source code of the requests.get () method. In the requests/api.py file, there are the following methods:

Requests.get ()

Requests.options ()

Requests.head ()

Requests.post ()

Requests.put ()

Requests.patch ()

Requests.delete ()

There is also a requests.request () method. This request method is finally called in all of the above methods. Let's take a look at what parameters are provided in these methods.

Parameters.

All the parameters in the requests.request () method are as follows:

Method: submission method. The parameters in the request method are filled in when the request method is called in other methods.

Url: submit address

Params: the parameter passed in url. That is, the parameter of get mode.

Data: the parameters passed in the request body, the content submitted by the Form form.

Json: the parameters passed in the request body, the content submitted by AJAX. Unlike data, the argument is serialized and the entire string is sent.

Headers: request header. There are several important request headers, which are listed below

Cookies: this is Cookies. It is sent to the server in the Cookie of the request header.

Files: upload files. Here are examples of how to use it

Auth: sets the authentication information for HTTP Auth. There is an expansion below.

Timeout: timeout. The unit is seconds and the type is float. There is a connection timeout and a wait return timeout, both of which are set. It can also be a meta-ancestor who sets two times separately (connect timeout, read timeout)

Allow_redirects: whether redirection is allowed. The default is True.

Proxies: use proxies. There is an expansion below.

Verify: for requests from https, if set to Flase, the certificate will be ignored.

Stream: download parameters. If False, download all of them to memory at first. If the content is too large, there is an expansion below.

Cert: if a certificate file is required to submit the request, set cert.

Data and json parameters

Both of these parameters are parameters that are requesting physical transfer. But the format is different, the final transmission on the network must be serialized strings. Different types generate a different request header. You can find the following code in the requests/models.py file:

If not data and json is not None: content_type = 'application/json'if data: if isinstance (data, basestring) or hasattr (data,' read'): content_type = None else: content_type = 'application/x-www-form-urlencoded'

That is, different formats will set different Content-Type request headers:

Data request header: 'application/x-www-form-urlencoded'

Json request header: 'application/json'

After receiving the request, the backend can first find the Content-Type in the request header, and then parse the data in the request body.

Why use two formats?

Form forms submit data data, and Form can only submit strings or lists, and there is no dictionary. That is, the value of value in the dictionary data can only be a string or list, not a dictionary. (dictionaries cannot be set in data dictionaries)

If you just need to submit a dictionary to the backend, you can only use josn.

Request header

Referer: the url of the last request

User-Agent: the browser used by the client

Send a file

This is the most basic usage. The dictionary's key F1 is the name of the Form form. In this example, the request method is used to submit the request. In the following example, only file_dict is different:

File_dict = {'F1: open ('test1.txt', rb)} requests.request (method='POST', url=' http://127.0.0.1:8000/test/', files=file_dict)

Custom file name:

File_dict = {'f2regions: (' mytest.txt', open ('test2.txt', rb))}

Customize the contents of the file (there are no file objects, and of course you have to decide the file name yourself):

File_dict = {'f3 written: ('test3.txt', "what you write or read from a file")}

HTTP Auth

HTTP Auth is a basic connection authentication. For example, the router used at home, ap, will pop up when logging in with web (basic login box, this is not a modal dialog box), is this kind of authentication. It sends the user name and password in the Authorization of the request header after being encrypted through base64.

Sample code used:

Import requestsdef param_auth (): from requests.auth import HTTPBasicAuth ret = requests.get ('https://api.github.com/user', auth=HTTPBasicAuth (' wupeiqi', 'sdfasdfasdf')) print (ret.text)

I see several classes in requests.auth, which should be different encryption or authentication methods, but the essence is to encrypt the authentication information and send it in the request header. Here we use HTTPBasicAuth as an example. The following is the source code of HTTPBasicAuth:

Class HTTPBasicAuth (AuthBase): "Attaches HTTPBasic Authentication to the given Request object." Def _ init__ (self, username, password): self.username = username self.password = password def _ eq__ (self, other): return all ([self.username = = getattr (other, 'username', None), self.password = = getattr (other,' password', None)]) def _ ne__ (self Other): return not self = = other def _ call__ (self, r): r.headers ['Authorization'] = _ basic_auth_str (self.username, self.password) return r

The above process is simple: encrypt the user name and password through the _ basic_auth_str method and add them to the 'Authorization' in the request header.

This kind of authentication is relatively simple, and websites published on the public network will not use this kind of authentication.

Proxies Agent

Write all the settings of the proxy in a dictionary. The settings for using the proxy are as follows:

Import requestsproxies1 = {'http':' 61.172.249.96 http': 80th, the request of # http uses this proxy 'https':' http://61.185.219.126:3128', The request for # https uses this agent} proxies2 = {'http://10.20.1.128':' http://10.10.1.10:5323'} # this specific station uses proxy r = requests.get ('http://www.google.com', proxies=proxies1)

If it is an agent that requires a user name and password, you need to use the above auth. The same is true here for auth, which is placed in the request header:

From requests.auth import HTTPProxyAuthauth = HTTPProxyAuth ('my_username',' my_password') # enter the username and password r = requests.get ('http://www.google.com', proxies=proxies1, auth=auth) here at a time

Stream download

After sending the request, do not download all the content immediately (download the complete content to memory at once). Instead, download it iteratively, bit by bit:

Import requestsdef param_stream (): from contextlib import closing with closing (requests.get ('http://httpbin.org/get', stream=True)) as r: # handles the response here. For i in r.iter_content (): print (I) # here, open a file in binary and write it. Session should be fine.

Using requests.Session () automatically manages Cookie for multiple requests, and sets some default information, such as request headers, and so on.

The usage is as follows:

Import requestssession = requests.Session () # generate a requests request after session instance # and use session instead of requests. For example, the get request is R1 = session.get ('https://dig.chouti.com')

Why not take a look at the source code:

Class Session (SessionRedirectMixin): "" A Requests session. Provides cookie persistence, connection-pooling, and configuration. Basic Usage:: > > import requests > s = requests.Session () > s.get ('http://httpbin.org/get') Or as a context manager:: > with requests.Session () as s: > s.get (' http://httpbin.org/get') "_ attrs__ = ['headers',' cookies', 'auth',' proxies', 'hooks'") 'params', 'verify',' cert', 'prefetch',' adapters', 'stream',' trust_env', 'max_redirects',]

In addition to being used after instantiation, you can also use the with method like a file operation.

The values in the attrs list are all the properties that session will automatically set for us.

For example, headers will add the following request header each time it is sent by default:

Def default_headers (): ": rtype: requests.structures.CaseInsensitiveDict"return CaseInsensitiveDict ({'User-Agent': default_user_agent (),' Accept-Encoding':', '.join ((' gzip', 'deflate')),' Accept':'* / *', 'Connection':' keep-alive',}) # User-Agent looks like this "python-requests/2.19.1" is followed by the software version of the requests module, which will change. # you can easily change s = requests.Session () s.headers ['User-Agent'] = ""

Learn here and then send a request, especially if you want to interact with the site multiple times. Just set up the Session and request it with Session. All settings are saved in the instance of Session, reused and managed automatically.

Optimize login and like

In the previous example of automatically logging in and giving likes, it would be much easier to change it with session, regardless of cookie:

Import requestsfrom bs4 import BeautifulSoupsession = requests.Session () # default value of User-Agent is "python-requests/2.19.1" will be anti-crawled Need to change session.headers ['User-Agent'] = "" session.get (' https://dig.chouti.com')# cannot upload password with open ('password/s2.txt') as f: auth = f.read () auth = auth.split ('\ n') post_dict = {'phone':' 86% s'% auth [0], # found in the request body The mobile number will be preceded by 86 'password': auth [1],} session.post (' https://dig.chouti.com/login', data=post_dict) # for consultation Then like R3 = session.get ('https://dig.chouti.com')r3.encoding = r3.apparent_encodingsoup = BeautifulSoup (r3.text, features='html.parser') target = soup.find (id='content-list') item = target.find (' div', {'class':' item'}) news = item.find ('averse, {' class': 'show-content'}). TextlinksId = item.find (' div' {'class':' part2'}). Attrs ['share-linkid'] print (' news:', news.strip ()) # like r = session.post ('https://dig.chouti.com/link/vote?linksId=%s'% linksId) print (r.text) these are all the contents of the article "Python Automation Development Learning how to implement a crawler" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.