The knowledge system involved in reptiles 07/06 Update SLTechnology News&Howtos

The knowledge system involved in reptiles

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The definition of a reptile:

A crawler is a program that can get information about a web page.

It is divided into general crawler and focused crawler.

1) Universal crawler: collects web pages and collects information from the Internet, which is used to index the search engine to provide support, which determines whether the content of the whole engine system is rich and whether the information is real-time. Therefore, its performance directly affects the effect of the search engine.

Crawl process:

In addition to HTML files, search engines can usually crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT files and so on. We also often see these file types in search results.

Search engines cannot handle non-text content such as pictures, videos, and Flash, nor can they execute scripts and programs.

However, these general-purpose search engines also have some limitations:

(1) the results returned by general search engines are web pages, and in most cases, 90% of the content in web pages is useless to users.

(2) users in different fields and backgrounds often have different retrieval purposes and needs, and the search engine can not provide search results for a specific user.

(3) with the rich data forms of the World wide Web and the continuous development of network technology, a large number of different data such as pictures, databases, audio and video multimedia appear, and the general search engine is powerless to these files and can not find and obtain them well.

(4) most of the general search engines provide keyword-based retrieval, so it is difficult to support the query based on semantic information and can not accurately understand the specific needs of users.

2) focus crawler

Focused crawler is a kind of web crawler which is "subject-oriented". The difference between focused crawler and general search engine crawler is that focused crawler will process and filter the content when crawling web pages. Try to make sure that only the web page information related to the requirements is crawled.

Part of the content involved in the process of sending the HTTP request by the browser:

1) when the user enters a URL in the browser's address bar and presses enter, the browser sends a HTTP request to the HTTP server. HTTP requests are mainly divided into two methods: "Get" and "Post".

2) when we enter URL http://www.baidu.com in the browser, the browser sends a Request request to get the html file of http://www.baidu.com, and the server sends the Response file object back to the browser.

3) the browser analyzes the HTML in Response and finds that there are many other files referenced, such as Images file, CSS file, JS file. The browser will automatically send Request again to get the image, CSS file, or JS file.

4) when all the files are downloaded successfully, the web page will be fully displayed according to the HTML syntax structure.

URL (abbreviation for Uniform / Universal Resource Locator): a uniform resource locator is an identification method used to fully describe the addresses of web pages and other resources on the Internet.

Basic format: scheme://host [: port#] / path/... / [? query-string] [# anchor]

Scheme: protocol (e.g. http, https, ftp)

Host: IP address or domain name of the server

The port of the port#: server (default port 80 if protocol default port is used)

Path: the path to access the resource

Query-string: parameter, data sent to the http server

Anchor: anchor (jumps to the specified anchor location of the web page)

For example:

Ftp://192.168.0.116:8080/index

Http://www.baidu.com

Http://item.jd.com/11936238.html#product-detail

5) HTTP requests are mainly divided into two methods: Get and Post:

GET is to get data from the server, and POST is to transfer data to the server.

The GET request parameters are displayed on the browser URL, and the HTTP server generates the response content according to the parameters in the URL contained in the request, that is, the parameters of the "Get" request are part of the URL. For example: http://www.baidu.com/s?wd=Chinese

The POST request parameters are in the request body, and the message length is unlimited and is sent implicitly. It is usually used to submit a large amount of data to the HTTP server (such as many parameters in the request or file upload operation, etc.). The request parameters are included in the "Content-Type" header, indicating the media type and coding of the message body.

Note: avoid using Get to submit forms, as it may cause security problems. For example, if you use Get in the login form, the user name and password entered by the user will be exposed in the address bar.

6) commonly used request headers:

Host (host and port number)

Connection (Link Type)

Upgrade-Insecure-Requests (upgrade to HTTPS request)

User-Agent (browser name)

Accept (transfer file type)

Referer (page jump)

Accept-Encoding (file codec format)

Accept-Language (language category)

Accept-Charset (character Encoding)

Cookie (Cookie)

Content-Type (POST data type)

7) commonly used response headers (understand)

Cache-Control:must-revalidate, no-cache, private .

This value tells the client that the server does not want the client to cache the resource, and that the next time the resource is requested, it must be from the new request server and cannot be obtained from the cached copy.

Connection:keep-alive

This field responds to the client's Connection:keep-alive, telling the client server that the tcp connection is also a persistent connection and that the client can continue to use this tcp connection to send http requests.

Content-Encoding:gzip

Tell the client that the resource sent by the server is encoded by gzip, and when the client sees this information, it should use gzip to decode the resource.

Content-Type:text/html;charset=UTF-8

Tell the client the type of the resource file, as well as character encoding, and the client decodes the resource through utf-8, and then parses the resource with html. Usually we will see that some websites are garbled, often the server side does not return the correct code.

Date:Sun, 21 Sep 2016 06:18:21 GMT

This is the server time when the server sends resources, and GMT is the standard time where Greenwich is located. The time sent in the http protocol is all GMT, which mainly solves the problem of time confusion when different time zones request resources each other on the Internet.

Expires:Sun, 1 Jan 2000 01:00:00 GMT

This response header is also related to caching, telling the client that before this time, the cached copy can be accessed directly. Obviously, this value will be problematic, because the time of the client and the server may not be the same. If the time is different, it will cause a problem. So this response header is not as accurate as the Cache-Control:max-age=* response header, because the date in max-age=date is a relative time, which is not only easier to understand, but also more accurate.

Pragma:no-cache

This meaning is equivalent to Cache-Control.

Server:Tengine/1.4.6

This is the server and the corresponding version, just telling the client server information.

Transfer-Encoding:chunked

This response header tells the client that the resources sent by the server are sent in blocks. Generally speaking, the resources sent in sub-blocks are dynamically generated by the server, and the size of the sent resources is not known at the time of sending, so it is sent in blocks. Each block is independent, and each independent block can indicate its own length. The last block is 0-length. When the client reads the 0-length block, it can be sure that the resource has been transmitted.

Vary: Accept-Encoding

Tell the cache server to cache both compressed and uncompressed files, and this field is not very useful now, because today's browsers all support compression.

8) response status code

The response status code consists of three digits, the first of which defines the category of the response and has five possible values.

Common status codes:

1000199: indicates that the server successfully received part of the request, requiring the client to continue to submit the rest of the request in order to complete the entire process.

2000299: indicates that the server successfully received the request and completed the entire process. 200 is commonly used (OK request is successful).

300 requests 399: to complete the request, the customer needs to further refine the request. For example, the requested resource has been moved to a new address, commonly used 302 (the requested page has been temporarily moved to a new url), 307 and 304 (using cached resources).

400 request 499: client request error, commonly used 404 (the server can not find the requested page), 403 (the server denied access, insufficient permissions).

500room599: an error occurred on the server side. 500 is commonly used (the request is not completed. The server encounters an unpredictable situation.

9) Cookie and Session:

The interaction between the server and the client is limited to the request / response process, then disconnects after the end, and the server considers the new client on the next request.

In order to maintain the link between them and let the server know that this is a request sent by the previous user, the client's information must be saved in one place.

Cookie: the identity of the user is determined by the information recorded on the client.

Session: the identity of the user is determined by the information recorded on the server side.

Several libraries commonly used in three crawler programs

1 urllib2 library

1) urllib2 is a module that comes with Python2.7 (you can use it without downloading it and importing it)

Urllib2 official document: https://docs.python.org/2/library/urllib2.html

Urllib2 source code: https://hg.python.org/cpython/file/2.7/Lib/urllib2.py

Urllib2 is changed to urllib.request in python3.x

2) request () and urlopen () methods commonly used in this library

Import urllib2

# url, as a parameter of the Request () method, constructs and returns a Request object

Request = urllib2.Request ("http://www.baidu.com")

# Request object, as a parameter of the urlopen () method, sends it to the server and receives the response

Response = urllib2.urlopen (request)

Html = response.read ()

Print html

3) to create a new Request instance, in addition to the url parameter, you can also set two other parameters:

Data (default empty): is the data submitted with url (such as data to post), while the HTTP request will be changed from "GET" mode to "POST" mode.

Headers (default empty): is a dictionary that contains key-value pairs of HTTP headers that need to be sent.

We'll talk about these two parameters below.

User-Agent

But if we use a legal identity to request other people's websites, it is obvious that they are welcome, so we should add an identity to our code, that is, the so-called User-Agent header.

A browser is a recognized and allowed identity in the Internet world, and if we want our crawler to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different User-Agent headers when sending requests. The default User-Agent header for urllib2 is: Python-urllib/x.y (x and y are Python major and minor version numbers, such as Python-urllib/2.7)

Example:

Randomly add / modify User-Agent

# urllib2_add_headers.py

Import urllib2

Import random

Url = "http://www.itcast.cn"

Ua_list = [

"Mozilla/5.0 (Windows NT 6.1;) Apple...."

"Mozilla/5.0 (X11; CrOS i686 2268.111.0)..."

"Mozilla/5.0 (Macintosh; U; PPC Mac OS X...."

"Mozilla/5.0 (Macintosh; Intel Mac OS...")

]

User_agent = random.choice (ua_list)

Request = urllib2.Request (url)

# you can also add / modify a specific header by calling Request.add_header ()

Request.add_header ("User-Agent", user_agent)

# the first letter is capitalized, followed by all lowercase

Request.get_header ("User-agent")

Response = urllib2.urlopen (req)

Html = response.read ()

Print html

4) urllib2 only supports GET and POST methods of HTTP/HTTPS by default

Urllib.urlencode ():

Urllib and urllib2 are related modules that accept URL requests, but provide different functions. The two most significant differences are as follows:

Urllib can only accept URL and cannot create an instance of Request class with headers set

However, urllib provides the urlencode method for GET query string generation, while urllib2 does not. (this is the main reason why urllib and urllib2 are often used together.) Encoding uses urllib's urlencode () function to help us convert key-value pairs like key:value into strings like "key=value", while decoding can use urllib's unquote () function. (note that it is not urllib2.urlencode ())

In general, HTTP requests to submit data, which need to be encoded into URL encoding format, and then as part of the url, or passed as a parameter to the Request object.

Get mode:

GET requests are generally used for us to get data from the server. For example, we use Baidu to search for wisdom podcasts: https://www.baidu.com/s?wd= wisdom podcasts

5) Handler processor and custom Opener

Opener is an instance of urllib2.OpenerDirector, the urlopen that we have been using before, and it is a special opener (that is, the module helps us build it).

But the basic urlopen () method does not support proxies, cookie, and other HTTP/HTTPS advanced features. So to support these features:

Use related Handler processors to create function-specific processor objects

Then use these handler objects through the urllib2.build_opener () method to create a custom opener object

Using a custom opener object, call the open () method to send the request.

If all requests in the program use a custom opener, you can use urllib2.install_opener () to define the custom opener object as a global opener, which means that if you call urlopen later, you will use this opener (choose according to your own needs).

6) use agents to define opener

ProxyHandler processor (proxy settings)

The proxy server is set up through ProxyHandler in urllib2. The following code shows how to use a custom opener to use the proxy:

Import urllib2

# build two proxy Handler, one with proxy IP and the other without proxy IP

Httpproxy_handler = urllib2.ProxyHandler ({"http": "124.88.67.81 http 80"})

Nullproxy_handler = urllib2.ProxyHandler ({})

ProxySwitch = True # define a proxy switch

# use these proxy Handler objects through the urllib2.build_opener () method to create custom opener objects

# use different proxy modes depending on whether the proxy switch is turned on

If proxySwitch:

Opener = urllib2.build_opener (httpproxy_handler)

Else:

Opener = urllib2.build_opener (nullproxy_handler)

Request = urllib2.Request ("http://www.baidu.com/")

# 1. If you write this, the custom proxy is used only if the request is sent using the opener.open () method, while urlopen () does not use the custom proxy.

Response = opener.open (request)

# 2. If you write this, you will apply opener globally, and then all requests, whether opener.open () or urlopen (), will use a custom proxy.

# urllib2.install_opener (opener)

# response = urlopen (request)

Print response.read ()

Use of open proxies:

Examples of free short-term agency websites:

West thorn free agent IP

Quick agent, free agent.

Proxy360 Agent

Network-wide agent IP

If there are enough IP agents, you can randomly select an agent to visit the site, just as you would if you randomly get the User-Agent.

Import urllib2

Import random

Proxy_list = [

{"http": "124.88.67.81 purl 80"}

]

# randomly select an agent

Proxy = random.choice (proxy_list)

# using selected agents to build proxy processor objects

Httpproxy_handler = urllib2.ProxyHandler (proxy)

Opener = urllib2.build_opener (httpproxy_handler)

Request = urllib2.Request ("http://www.baidu.com/")

Response = opener.open (request)

Print response.read ()

However, these free and open agents are generally used by a lot of people, and the agents have some disadvantages, such as short life, slow speed, low anonymity, unstable HTTP/HTTPS support and so on.

Therefore, professional reptile engineers or reptile companies will use high-quality private agents.

Private agents:

= = HTTPPasswordMgrWithDefaultRealm ()

The HTTPPasswordMgrWithDefaultRealm () class creates a password management object, which is used to store the user name and password related to the HTTP request. There are two main scenarios:

Verify the username and password authorized by the agent (ProxyBasicAuthHandler ())

Verify the user name and password of the Web client (HTTPBasicAuthHandler ())

Example:

Import urllib

Import urllib2

# user name

User = "test"

# password

Passwd = "123456"

# Web Server IP

Webserver = "http://192.168.199.107"

# build a password management object to store user names and passwords that need to be processed

Passwdmgr = urllib2.HTTPPasswordMgrWithDefaultRealm ()

# add account information. The first parameter realm is the domain information related to the remote server. Generally, no one writes None, and the last three parameters are Web server, user name and password.

Passwdmgr.add_password (None, webserver, user, passwd)

# build a HTTPBasicAuthHandler processor object for HTTP basic username / password authentication. The parameter is the password management object created.

Httpauth_handler = urllib2.HTTPBasicAuthHandler (passwdmgr)

# use these proxy Handler objects through the build_opener () method to create custom opener objects with parameters including the built proxy_handler

Opener = urllib2.build_opener (httpauth_handler)

# you can choose to define opener as a global opener through the install_opener () method

Urllib2.install_opener (opener)

# Building Request objects

Request = urllib2.Request ("http://192.168.199.107")

# after defining opener as a global opener, you can use urlopen () to send the request directly

Response = urllib2.urlopen (request)

# print response content

Print response.read ()

7) Cookie:

HTTP is a stateless connection-oriented protocol. In order to maintain the connection state, the Cookie mechanism Cookie is an attribute in the http header, including:

Cookie name (Name)

Value of Cookie (Value)

Expiration time of Cookie (Expires/Max-Age)

Cookie action pathway (Path)

Domain name where Cookie is located (Domain)

Use Cookie for secure connection (Secure).

The first two parameters are necessary for Cookie applications, as well as the Cookie size (Size, which varies from browser to browser on the number and size of Cookie).

Cookie consists of variable names and values. According to the regulations of Netscape, the Cookie format is as follows:

Set-Cookie: NAME=VALUE;Expires=DATE;Path=PATH;Domain=DOMAIN_NAME;SECURE

(2) Cookie application

The most typical application of Cookies in crawlers is to determine whether registered users have logged on to the site, and users may be prompted to keep user information the next time they enter the site in order to simplify login procedures.

Example 1:

Import urllib2

# build the headers information of a user who has already logged in

Headers = {

"Host": "www.renren.com"

"Connection": "keep-alive"

"Upgrade-Insecure-Requests": "1"

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8"

"Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6"

# easy for terminal to read, indicating that compressed files are not supported

# Accept-Encoding: gzip, deflate, sdch

# important: this Cookie is the Cookie of a user whose password does not need to be logged in repeatedly. The user name is recorded in this Cookie, and the password (usually encrypted by RAS) "Cookie": "anonymid=ixrna3fysufnwv; depovince=GW; _ r01encryption1; JSESSIONID=abcmaDhEdqIlM7riy5iMv; jebe_key=f6fb270b-d06d-42e6-8b53Maie67c3156aaa7e% 7Cc13c37f53bca9e1e7132d4b58ce00fa3% 7C14840607478% 7C14840607173; jebecookies=26fb58d1-cbe7-4fc3-a4ad-592233d1b42e |; ick_login=1f2b895d-34c7-4a1dafb7FUD 84666fad409; _ de=BF09EE3A28DED52E6B65F6A4705D973F1383380866D39FF5; pendant 99e54330ba9f910b02e6b058f780479; ap=327550029; first_login_flag=1 Ln_uact=mr_mao_hacker@163.com; ln_hurl= http://hdn.xnimg.cn/photos/hdn521/20140529/1055/h_main_9A3Z_e0c300019f6a195a.jpg; tweets 214ca9a28f70ca6aa0801404dda4f6789; societyguester=214ca9a28f70ca6aa0801404dda4f6789; id=327550029; xnsid=745033c5; ver=7.0; loginfrom=syshome "

}

# 2. Construct the Request object through the header information (mainly Cookie information) in headers

Urllib2.Request ("http://www.renren.com/", headers = headers)

# 3. Directly visit the renren home page, the server will judge that this is a logged-in user based on the headers header information (mainly Cookie information), and return the corresponding page

Response = urllib2.urlopen (request)

# 4. Print response content

Print response.read ()

But this is too complicated, we need to log in to the account in the browser, and set the save password, and grab the package to get the Cookie, is there a more simple and convenient way?

Example 2:

Cookielib library

The main objects of this module are CookieJar, FileCookieJar, MozillaCookieJar and LWPCookieJar.

CookieJar: an object that manages the HTTP cookie value, stores the cookie generated by the HTTP request, and adds cookie to the outgoing HTTP request. The entire cookie is stored in memory, and the cookie will be lost after garbage collection of the CookieJar instance.

FileCookieJar (filename,delayload=None,policy=None): derived from CookieJar, used to create FileCookieJar instances, retrieve cookie information, and store cookie in a file. Filename is the file name where the cookie is stored. Deferred access to files is supported when delayload is True, that is, files are read or stored in files only when needed.

MozillaCookieJar (filename,delayload=None,policy=None): derived from FileCookieJar, creates a FileCookieJar instance that is compatible with the Mozilla browser cookies.txt.

LWPCookieJar (filename,delayload=None,policy=None): derived from FileCookieJar, creates a FileCookieJar instance that is compatible with the Set-Cookie3 file format of the libwww-perl standard.

In fact, in most cases, we only use CookieJar (), and if we need to interact with local files, we use MozillaCookjar () or LWPCookieJar ()

Import urllib

Import urllib2

Import cookielib

# build an instance of CookieJar object to save cookie

Cookie = cookielib.CookieJar ()

# 2. Use HTTPCookieProcessor () to create a cookie processor object with an argument of CookieJar () object

Cookie_handler = urllib2.HTTPCookieProcessor (cookie)

# 3. Build opener through build_opener ()

Opener = urllib2.build_opener (cookie_handler)

# 4. Addheaders accepts a list in which each element is a meta-ancestor of headers information, and opener will be accompanied by headers information

Opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36")]

# 5. The account and password required to log in

Data = {"email": "mr_mao_hacker@163.com", "password": "alaxxxxxime"}

# 6. Transcoding via urlencode ()

Postdata = urllib.urlencode (data)

# 7. Build a Request request object that contains the user name and password to be sent

Request = urllib2.Request ("http://www.renren.com/PLogin.do", data = postdata)

# 8. Send this request through opener and get the cookie value after login

Opener.open (request)

# 9. Opener contains the cookie value after the user logs in, and you can directly access those pages that can only be accessed after login.

Response = opener.open ("http://www.renren.com/410043129/profile")

# 10. Print response content

Print response.read ()

There are several points to pay attention to when simulating login:

Login usually starts with a HTTP GET, which is used to pull some information and obtain Cookie, and then log in with HTTP POST.

(1) the link to the HTTP POST login may be dynamic, obtained from the information returned by GET.

(2) some password messages are sent in clear text and some are encrypted. Some websites even use dynamic encryption, including a lot of other data encryption (3) information, only by looking at the JS source code to get the encryption algorithm, and then to crack the encryption, very difficult.

(4) the overall login process of most websites is similar, and some details may be different, so there is no guarantee of successful login for other sites.

8) exception error handling of urllib2

The main reasons for URLError are:

(1) No network connection

(2) Server connection failed

(3) the specified server cannot be found

We can use the try except statement to catch the corresponding exception. In the following example, we visit a domain name that does not exist:

# urllib2_urlerror.py

Import urllib2

Requset = urllib2.Request ('http://www.ajkfhafwjqh.com')

Try:

Urllib2.urlopen (request, timeout=5)

Except urllib2.URLError, err:

Print err

HTTPError

HTTPError is a subclass of URLError, and when we make a request, there is a response reply object on the server that contains a numeric "response status code".

If urlopen or opener.open cannot handle, a HTTPError is generated, corresponding to the corresponding status code, and the HTTP status code represents the status of the response returned by the HTTP protocol.

Note that urllib2 can handle redirected pages (that is, response codes at the beginning of 3) for us, and numbers in the range of 100,299 indicate success, so we can only see error numbers of 400,599.

-improved version

Since the parent class of HTTPError is URLError, the exception of the parent class should be written after the exception of the subclass, so the above code can be rewritten as follows:

# urllib2_botherror.py

Import urllib2

Requset = urllib2.Request ('http://blog.baidu.com/itcast')

Try:

Urllib2.urlopen (requset)

Except urllib2.HTTPError, err:

Print err.code

Except urllib2.URLError, err:

Print err

Else:

Print "Good Job"

2 Requests module

Requests inherits all the features of urllib2. Requests supports HTTP connection persistence and connection pooling, cookie session persistence, file upload, automatic determination of response content encoding, and automatic encoding of internationalized URL and POST data.

The underlying implementation of requests is actually urllib3.

The documentation of Requests is very complete, and the Chinese documentation is also quite good. Requests can fully meet the needs of the current network, support Python 2.63.5, and can run perfectly under PyPy.

Open source address: https://github.com/kennethreitz/requests

Chinese document API: http://docs.python-requests.org/zh_CN/latest/index.html

GET request:

(1) the most basic GET request can directly use the get method.

Response = requests.get ("http://www.baidu.com/")

# it can also be written that way

# response = requests.request ("get", "http://www.baidu.com/")"

(2) add headers and query parameters

If you want to add headers, you can pass in the headers parameter to increase the headers information in the request header. If you want to pass parameters in url, you can take advantage of the params parameter.

Import requests

Kw = {'wd':' Great Wall'}

Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# params receives a query parameter of a dictionary or string, and the dictionary type is automatically converted to url encoding without the need for urlencode ()

Response = requests.get ("http://www.baidu.com/s?", params = kw, headers = headers)

# check the response content. Response.text returns data in Unicode format

Print response.text

# View the response content and the byte stream data returned by response.content

Print respones.content

# View the full url address

Print response.url

# View the character encoding of the response header

Print response.encoding

# View response codes

Print response.status_code

POST request:

(1) the most basic GET request can directly use the post method.

Response = requests.post ("http://www.baidu.com/", data = data)

(2) input data data

For a POST request, we generally need to add some parameters to it. Then the most basic parameter passing method can make use of the parameter data.

Import requests

Formdata = {

"type": "AUTO"

"I": "i love python"

"doctype": "json"

"xmlVersion": "1.8"

"keyfrom": "fanyi.web"

"ue": "UTF-8"

"action": "FY_BY_ENTER"

"typoResult": "true"

}

Url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null"

Headers= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}

Response = requests.post (url, data = formdata, headers = headers)

Print response.text

# if it is a json file, it can be displayed directly

Print response.json ()

Proxy (proxies parameter)

If you need to use a proxy, you can configure a single request by providing a proxies parameter for any request method:

Import requests

# choose different agents according to the type of protocol

Proxies = {

"http": "http://12.34.56.79:9527","

"https": "http://12.34.56.79:9527","

}

Response = requests.get ("http://www.baidu.com", proxies = proxies)

Print response.text

You can also configure the agent through the local environment variables HTTP_PROXY and HTTPS_PROXY:

Export HTTP_PROXY= "http://12.34.56.79:9527"

Export HTTPS_PROXY= "https://12.34.56.79:9527"

Private agent authentication (specific format) and Web client authentication (auth parameter)

Urllib2's approach here is complicated, and requests only needs one step:

Private agent

Import requests

# if the agent needs to use HTTP Basic Auth, you can use the following format:

Proxy = {"http": "mr_mao_hacker:sffqry9r@61.158.163.130:16816"}

Response = requests.get ("http://www.baidu.com", proxies = proxy)

Print response.text

Web client authentication

For Web client verification, you need to add auth = (account name, password)

Import requests

Auth= ('test',' 123456')

Response = requests.get ('http://192.168.199.107', auth = auth)

Print response.text

Cookies and Sission

If a response contains cookie, then we can use the cookies parameter to get:

Import requests

Response = requests.get ("http://www.baidu.com/")

# return CookieJar object:

Cookiejar = response.cookies

# convert CookieJar into a dictionary:

Cookiedict = requests.utils.dict_from_cookiejar (cookiejar)

Print cookiejar

Print cookiedict

Running result:

{'BDORZ':' 27315'}

Sission

In requests, the session object is a very common object that represents a user session: from the client browser connecting to the server, to the client browser disconnecting from the server.

Conversations allow us to maintain certain parameters across requests, such as maintaining cookie between all requests made by the same Session instance.

Realize the login of Renren

Import requests

# 1. Create a session object to save cookie values

Ssion = requests.session ()

# 2. Dealing with headers

Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# 3. User name and password required for login

Data = {"email": "mr_mao_hacker@163.com", "password": "alarmchime"}

# 4. Send a request with a user name and password, get the cookie value after login, and save it in ssion

Ssion.post ("http://www.renren.com/PLogin.do", data = data)

# 5. Ssion contains the cookie value after the user logs in, and you can directly access those pages that can only be accessed after login.

Response = ssion.get ("http://www.renren.com/410043129/profile")

# 6. Print response content

Print response.text

Process HTTPS request SSL certificate verification

Requests can also validate SSL certificates for HTTPS requests:

To check the SSL certificate of a host, you can use the verify parameter (or not write)

Import requests

Response = requests.get ("https://www.baidu.com/", verify=True)

# you can also omit it.

# response = requests.get ("https://www.baidu.com/")

Print r.text

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.