In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
The definition of a reptile:
A crawler is a program that can get information about a web page.
It is divided into general crawler and focused crawler.
1) Universal crawler: collects web pages and collects information from the Internet, which is used to index the search engine to provide support, which determines whether the content of the whole engine system is rich and whether the information is real-time. Therefore, its performance directly affects the effect of the search engine.
Crawl process:
In addition to HTML files, search engines can usually crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT files and so on. We also often see these file types in search results.
Search engines cannot handle non-text content such as pictures, videos, and Flash, nor can they execute scripts and programs.
However, these general-purpose search engines also have some limitations:
(1) the results returned by general search engines are web pages, and in most cases, 90% of the content in web pages is useless to users.
(2) users in different fields and backgrounds often have different retrieval purposes and needs, and the search engine can not provide search results for a specific user.
(3) with the rich data forms of the World wide Web and the continuous development of network technology, a large number of different data such as pictures, databases, audio and video multimedia appear, and the general search engine is powerless to these files and can not find and obtain them well.
(4) most of the general search engines provide keyword-based retrieval, so it is difficult to support the query based on semantic information and can not accurately understand the specific needs of users.
2) focus crawler
Focused crawler is a kind of web crawler which is "subject-oriented". The difference between focused crawler and general search engine crawler is that focused crawler will process and filter the content when crawling web pages. Try to make sure that only the web page information related to the requirements is crawled.
Part of the content involved in the process of sending the HTTP request by the browser:
1) when the user enters a URL in the browser's address bar and presses enter, the browser sends a HTTP request to the HTTP server. HTTP requests are mainly divided into two methods: "Get" and "Post".
2) when we enter URL http://www.baidu.com in the browser, the browser sends a Request request to get the html file of http://www.baidu.com, and the server sends the Response file object back to the browser.
3) the browser analyzes the HTML in Response and finds that there are many other files referenced, such as Images file, CSS file, JS file. The browser will automatically send Request again to get the image, CSS file, or JS file.
4) when all the files are downloaded successfully, the web page will be fully displayed according to the HTML syntax structure.
URL (abbreviation for Uniform / Universal Resource Locator): a uniform resource locator is an identification method used to fully describe the addresses of web pages and other resources on the Internet.
Basic format: scheme://host [: port#] / path/... / [? query-string] [# anchor]
Scheme: protocol (e.g. http, https, ftp)
Host: IP address or domain name of the server
The port of the port#: server (default port 80 if protocol default port is used)
Path: the path to access the resource
Query-string: parameter, data sent to the http server
Anchor: anchor (jumps to the specified anchor location of the web page)
For example:
Ftp://192.168.0.116:8080/index
Http://www.baidu.com
Http://item.jd.com/11936238.html#product-detail
5) HTTP requests are mainly divided into two methods: Get and Post:
GET is to get data from the server, and POST is to transfer data to the server.
The GET request parameters are displayed on the browser URL, and the HTTP server generates the response content according to the parameters in the URL contained in the request, that is, the parameters of the "Get" request are part of the URL. For example: http://www.baidu.com/s?wd=Chinese
The POST request parameters are in the request body, and the message length is unlimited and is sent implicitly. It is usually used to submit a large amount of data to the HTTP server (such as many parameters in the request or file upload operation, etc.). The request parameters are included in the "Content-Type" header, indicating the media type and coding of the message body.
Note: avoid using Get to submit forms, as it may cause security problems. For example, if you use Get in the login form, the user name and password entered by the user will be exposed in the address bar.
6) commonly used request headers:
Host (host and port number)
Connection (Link Type)
Upgrade-Insecure-Requests (upgrade to HTTPS request)
User-Agent (browser name)
Accept (transfer file type)
Referer (page jump)
Accept-Encoding (file codec format)
Accept-Language (language category)
Accept-Charset (character Encoding)
Cookie (Cookie)
Content-Type (POST data type)
7) commonly used response headers (understand)
Cache-Control:must-revalidate, no-cache, private .
This value tells the client that the server does not want the client to cache the resource, and that the next time the resource is requested, it must be from the new request server and cannot be obtained from the cached copy.
Connection:keep-alive
This field responds to the client's Connection:keep-alive, telling the client server that the tcp connection is also a persistent connection and that the client can continue to use this tcp connection to send http requests.
Content-Encoding:gzip
Tell the client that the resource sent by the server is encoded by gzip, and when the client sees this information, it should use gzip to decode the resource.
Content-Type:text/html;charset=UTF-8
Tell the client the type of the resource file, as well as character encoding, and the client decodes the resource through utf-8, and then parses the resource with html. Usually we will see that some websites are garbled, often the server side does not return the correct code.
Date:Sun, 21 Sep 2016 06:18:21 GMT
This is the server time when the server sends resources, and GMT is the standard time where Greenwich is located. The time sent in the http protocol is all GMT, which mainly solves the problem of time confusion when different time zones request resources each other on the Internet.
Expires:Sun, 1 Jan 2000 01:00:00 GMT
This response header is also related to caching, telling the client that before this time, the cached copy can be accessed directly. Obviously, this value will be problematic, because the time of the client and the server may not be the same. If the time is different, it will cause a problem. So this response header is not as accurate as the Cache-Control:max-age=* response header, because the date in max-age=date is a relative time, which is not only easier to understand, but also more accurate.
Pragma:no-cache
This meaning is equivalent to Cache-Control.
Server:Tengine/1.4.6
This is the server and the corresponding version, just telling the client server information.
Transfer-Encoding:chunked
This response header tells the client that the resources sent by the server are sent in blocks. Generally speaking, the resources sent in sub-blocks are dynamically generated by the server, and the size of the sent resources is not known at the time of sending, so it is sent in blocks. Each block is independent, and each independent block can indicate its own length. The last block is 0-length. When the client reads the 0-length block, it can be sure that the resource has been transmitted.
Vary: Accept-Encoding
Tell the cache server to cache both compressed and uncompressed files, and this field is not very useful now, because today's browsers all support compression.
8) response status code
The response status code consists of three digits, the first of which defines the category of the response and has five possible values.
Common status codes:
1000199: indicates that the server successfully received part of the request, requiring the client to continue to submit the rest of the request in order to complete the entire process.
2000299: indicates that the server successfully received the request and completed the entire process. 200 is commonly used (OK request is successful).
300 requests 399: to complete the request, the customer needs to further refine the request. For example, the requested resource has been moved to a new address, commonly used 302 (the requested page has been temporarily moved to a new url), 307 and 304 (using cached resources).
400 request 499: client request error, commonly used 404 (the server can not find the requested page), 403 (the server denied access, insufficient permissions).
500room599: an error occurred on the server side. 500 is commonly used (the request is not completed. The server encounters an unpredictable situation.
9) Cookie and Session:
The interaction between the server and the client is limited to the request / response process, then disconnects after the end, and the server considers the new client on the next request.
In order to maintain the link between them and let the server know that this is a request sent by the previous user, the client's information must be saved in one place.
Cookie: the identity of the user is determined by the information recorded on the client.
Session: the identity of the user is determined by the information recorded on the server side.
Several libraries commonly used in three crawler programs
1 urllib2 library
1) urllib2 is a module that comes with Python2.7 (you can use it without downloading it and importing it)
Urllib2 official document: https://docs.python.org/2/library/urllib2.html
Urllib2 source code: https://hg.python.org/cpython/file/2.7/Lib/urllib2.py
Urllib2 is changed to urllib.request in python3.x
2) request () and urlopen () methods commonly used in this library
Import urllib2
# url, as a parameter of the Request () method, constructs and returns a Request object
Request = urllib2.Request ("http://www.baidu.com")
# Request object, as a parameter of the urlopen () method, sends it to the server and receives the response
Response = urllib2.urlopen (request)
Html = response.read ()
Print html
3) to create a new Request instance, in addition to the url parameter, you can also set two other parameters:
Data (default empty): is the data submitted with url (such as data to post), while the HTTP request will be changed from "GET" mode to "POST" mode.
Headers (default empty): is a dictionary that contains key-value pairs of HTTP headers that need to be sent.
We'll talk about these two parameters below.
User-Agent
But if we use a legal identity to request other people's websites, it is obvious that they are welcome, so we should add an identity to our code, that is, the so-called User-Agent header.
A browser is a recognized and allowed identity in the Internet world, and if we want our crawler to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different User-Agent headers when sending requests. The default User-Agent header for urllib2 is: Python-urllib/x.y (x and y are Python major and minor version numbers, such as Python-urllib/2.7)
Example:
Randomly add / modify User-Agent
# urllib2_add_headers.py
Import urllib2
Import random
Url = "http://www.itcast.cn"
Ua_list = [
"Mozilla/5.0 (Windows NT 6.1;) Apple...."
"Mozilla/5.0 (X11; CrOS i686 2268.111.0)..."
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X...."
"Mozilla/5.0 (Macintosh; Intel Mac OS...")
]
User_agent = random.choice (ua_list)
Request = urllib2.Request (url)
# you can also add / modify a specific header by calling Request.add_header ()
Request.add_header ("User-Agent", user_agent)
# the first letter is capitalized, followed by all lowercase
Request.get_header ("User-agent")
Response = urllib2.urlopen (req)
Html = response.read ()
Print html
4) urllib2 only supports GET and POST methods of HTTP/HTTPS by default
Urllib.urlencode ():
Urllib and urllib2 are related modules that accept URL requests, but provide different functions. The two most significant differences are as follows:
Urllib can only accept URL and cannot create an instance of Request class with headers set
However, urllib provides the urlencode method for GET query string generation, while urllib2 does not. (this is the main reason why urllib and urllib2 are often used together.) Encoding uses urllib's urlencode () function to help us convert key-value pairs like key:value into strings like "key=value", while decoding can use urllib's unquote () function. (note that it is not urllib2.urlencode ())
In general, HTTP requests to submit data, which need to be encoded into URL encoding format, and then as part of the url, or passed as a parameter to the Request object.
Get mode:
GET requests are generally used for us to get data from the server. For example, we use Baidu to search for wisdom podcasts: https://www.baidu.com/s?wd= wisdom podcasts
5) Handler processor and custom Opener
Opener is an instance of urllib2.OpenerDirector, the urlopen that we have been using before, and it is a special opener (that is, the module helps us build it).
But the basic urlopen () method does not support proxies, cookie, and other HTTP/HTTPS advanced features. So to support these features:
Use related Handler processors to create function-specific processor objects
Then use these handler objects through the urllib2.build_opener () method to create a custom opener object
Using a custom opener object, call the open () method to send the request.
If all requests in the program use a custom opener, you can use urllib2.install_opener () to define the custom opener object as a global opener, which means that if you call urlopen later, you will use this opener (choose according to your own needs).
6) use agents to define opener
ProxyHandler processor (proxy settings)
The proxy server is set up through ProxyHandler in urllib2. The following code shows how to use a custom opener to use the proxy:
Import urllib2
# build two proxy Handler, one with proxy IP and the other without proxy IP
Httpproxy_handler = urllib2.ProxyHandler ({"http": "124.88.67.81 http 80"})
Nullproxy_handler = urllib2.ProxyHandler ({})
ProxySwitch = True # define a proxy switch
# use these proxy Handler objects through the urllib2.build_opener () method to create custom opener objects
# use different proxy modes depending on whether the proxy switch is turned on
If proxySwitch:
Opener = urllib2.build_opener (httpproxy_handler)
Else:
Opener = urllib2.build_opener (nullproxy_handler)
Request = urllib2.Request ("http://www.baidu.com/")
# 1. If you write this, the custom proxy is used only if the request is sent using the opener.open () method, while urlopen () does not use the custom proxy.
Response = opener.open (request)
# 2. If you write this, you will apply opener globally, and then all requests, whether opener.open () or urlopen (), will use a custom proxy.
# urllib2.install_opener (opener)
# response = urlopen (request)
Print response.read ()
Use of open proxies:
Examples of free short-term agency websites:
West thorn free agent IP
Quick agent, free agent.
Proxy360 Agent
Network-wide agent IP
If there are enough IP agents, you can randomly select an agent to visit the site, just as you would if you randomly get the User-Agent.
Import urllib2
Import random
Proxy_list = [
{"http": "124.88.67.81 purl 80"}
{"http": "124.88.67.81 purl 80"}
{"http": "124.88.67.81 purl 80"}
{"http": "124.88.67.81 purl 80"}
{"http": "124.88.67.81 purl 80"}
]
# randomly select an agent
Proxy = random.choice (proxy_list)
# using selected agents to build proxy processor objects
Httpproxy_handler = urllib2.ProxyHandler (proxy)
Opener = urllib2.build_opener (httpproxy_handler)
Request = urllib2.Request ("http://www.baidu.com/")
Response = opener.open (request)
Print response.read ()
However, these free and open agents are generally used by a lot of people, and the agents have some disadvantages, such as short life, slow speed, low anonymity, unstable HTTP/HTTPS support and so on.
Therefore, professional reptile engineers or reptile companies will use high-quality private agents.
Private agents:
= = HTTPPasswordMgrWithDefaultRealm ()
The HTTPPasswordMgrWithDefaultRealm () class creates a password management object, which is used to store the user name and password related to the HTTP request. There are two main scenarios:
Verify the username and password authorized by the agent (ProxyBasicAuthHandler ())
Verify the user name and password of the Web client (HTTPBasicAuthHandler ())
Example:
Import urllib
Import urllib2
# user name
User = "test"
# password
Passwd = "123456"
# Web Server IP
Webserver = "http://192.168.199.107"
# build a password management object to store user names and passwords that need to be processed
Passwdmgr = urllib2.HTTPPasswordMgrWithDefaultRealm ()
# add account information. The first parameter realm is the domain information related to the remote server. Generally, no one writes None, and the last three parameters are Web server, user name and password.
Passwdmgr.add_password (None, webserver, user, passwd)
# build a HTTPBasicAuthHandler processor object for HTTP basic username / password authentication. The parameter is the password management object created.
Httpauth_handler = urllib2.HTTPBasicAuthHandler (passwdmgr)
# use these proxy Handler objects through the build_opener () method to create custom opener objects with parameters including the built proxy_handler
Opener = urllib2.build_opener (httpauth_handler)
# you can choose to define opener as a global opener through the install_opener () method
Urllib2.install_opener (opener)
# Building Request objects
Request = urllib2.Request ("http://192.168.199.107")
# after defining opener as a global opener, you can use urlopen () to send the request directly
Response = urllib2.urlopen (request)
# print response content
Print response.read ()
7) Cookie:
HTTP is a stateless connection-oriented protocol. In order to maintain the connection state, the Cookie mechanism Cookie is an attribute in the http header, including:
Cookie name (Name)
Value of Cookie (Value)
Expiration time of Cookie (Expires/Max-Age)
Cookie action pathway (Path)
Domain name where Cookie is located (Domain)
Use Cookie for secure connection (Secure).
The first two parameters are necessary for Cookie applications, as well as the Cookie size (Size, which varies from browser to browser on the number and size of Cookie).
Cookie consists of variable names and values. According to the regulations of Netscape, the Cookie format is as follows:
Set-Cookie: NAME=VALUE;Expires=DATE;Path=PATH;Domain=DOMAIN_NAME;SECURE
(2) Cookie application
The most typical application of Cookies in crawlers is to determine whether registered users have logged on to the site, and users may be prompted to keep user information the next time they enter the site in order to simplify login procedures.
Example 1:
Import urllib2
# build the headers information of a user who has already logged in
Headers = {
"Host": "www.renren.com"
"Connection": "keep-alive"
"Upgrade-Insecure-Requests": "1"
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8"
"Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6"
# easy for terminal to read, indicating that compressed files are not supported
# Accept-Encoding: gzip, deflate, sdch
# important: this Cookie is the Cookie of a user whose password does not need to be logged in repeatedly. The user name is recorded in this Cookie, and the password (usually encrypted by RAS) "Cookie": "anonymid=ixrna3fysufnwv; depovince=GW; _ r01encryption1; JSESSIONID=abcmaDhEdqIlM7riy5iMv; jebe_key=f6fb270b-d06d-42e6-8b53Maie67c3156aaa7e% 7Cc13c37f53bca9e1e7132d4b58ce00fa3% 7C14840607478% 7C14840607173; jebecookies=26fb58d1-cbe7-4fc3-a4ad-592233d1b42e |; ick_login=1f2b895d-34c7-4a1dafb7FUD 84666fad409; _ de=BF09EE3A28DED52E6B65F6A4705D973F1383380866D39FF5; pendant 99e54330ba9f910b02e6b058f780479; ap=327550029; first_login_flag=1 Ln_uact=mr_mao_hacker@163.com; ln_hurl= http://hdn.xnimg.cn/photos/hdn521/20140529/1055/h_main_9A3Z_e0c300019f6a195a.jpg; tweets 214ca9a28f70ca6aa0801404dda4f6789; societyguester=214ca9a28f70ca6aa0801404dda4f6789; id=327550029; xnsid=745033c5; ver=7.0; loginfrom=syshome "
}
# 2. Construct the Request object through the header information (mainly Cookie information) in headers
Urllib2.Request ("http://www.renren.com/", headers = headers)
# 3. Directly visit the renren home page, the server will judge that this is a logged-in user based on the headers header information (mainly Cookie information), and return the corresponding page
Response = urllib2.urlopen (request)
# 4. Print response content
Print response.read ()
But this is too complicated, we need to log in to the account in the browser, and set the save password, and grab the package to get the Cookie, is there a more simple and convenient way?
Example 2:
Cookielib library
The main objects of this module are CookieJar, FileCookieJar, MozillaCookieJar and LWPCookieJar.
CookieJar: an object that manages the HTTP cookie value, stores the cookie generated by the HTTP request, and adds cookie to the outgoing HTTP request. The entire cookie is stored in memory, and the cookie will be lost after garbage collection of the CookieJar instance.
FileCookieJar (filename,delayload=None,policy=None): derived from CookieJar, used to create FileCookieJar instances, retrieve cookie information, and store cookie in a file. Filename is the file name where the cookie is stored. Deferred access to files is supported when delayload is True, that is, files are read or stored in files only when needed.
MozillaCookieJar (filename,delayload=None,policy=None): derived from FileCookieJar, creates a FileCookieJar instance that is compatible with the Mozilla browser cookies.txt.
LWPCookieJar (filename,delayload=None,policy=None): derived from FileCookieJar, creates a FileCookieJar instance that is compatible with the Set-Cookie3 file format of the libwww-perl standard.
In fact, in most cases, we only use CookieJar (), and if we need to interact with local files, we use MozillaCookjar () or LWPCookieJar ()
Log in to Renren by using cookielib and post
Import urllib
Import urllib2
Import cookielib
# build an instance of CookieJar object to save cookie
Cookie = cookielib.CookieJar ()
# 2. Use HTTPCookieProcessor () to create a cookie processor object with an argument of CookieJar () object
Cookie_handler = urllib2.HTTPCookieProcessor (cookie)
# 3. Build opener through build_opener ()
Opener = urllib2.build_opener (cookie_handler)
# 4. Addheaders accepts a list in which each element is a meta-ancestor of headers information, and opener will be accompanied by headers information
Opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36")]
# 5. The account and password required to log in
Data = {"email": "mr_mao_hacker@163.com", "password": "alaxxxxxime"}
# 6. Transcoding via urlencode ()
Postdata = urllib.urlencode (data)
# 7. Build a Request request object that contains the user name and password to be sent
Request = urllib2.Request ("http://www.renren.com/PLogin.do", data = postdata)
# 8. Send this request through opener and get the cookie value after login
Opener.open (request)
# 9. Opener contains the cookie value after the user logs in, and you can directly access those pages that can only be accessed after login.
Response = opener.open ("http://www.renren.com/410043129/profile")
# 10. Print response content
Print response.read ()
There are several points to pay attention to when simulating login:
Login usually starts with a HTTP GET, which is used to pull some information and obtain Cookie, and then log in with HTTP POST.
(1) the link to the HTTP POST login may be dynamic, obtained from the information returned by GET.
(2) some password messages are sent in clear text and some are encrypted. Some websites even use dynamic encryption, including a lot of other data encryption (3) information, only by looking at the JS source code to get the encryption algorithm, and then to crack the encryption, very difficult.
(4) the overall login process of most websites is similar, and some details may be different, so there is no guarantee of successful login for other sites.
8) exception error handling of urllib2
The main reasons for URLError are:
(1) No network connection
(2) Server connection failed
(3) the specified server cannot be found
We can use the try except statement to catch the corresponding exception. In the following example, we visit a domain name that does not exist:
# urllib2_urlerror.py
Import urllib2
Requset = urllib2.Request ('http://www.ajkfhafwjqh.com')
Try:
Urllib2.urlopen (request, timeout=5)
Except urllib2.URLError, err:
Print err
HTTPError
HTTPError is a subclass of URLError, and when we make a request, there is a response reply object on the server that contains a numeric "response status code".
If urlopen or opener.open cannot handle, a HTTPError is generated, corresponding to the corresponding status code, and the HTTP status code represents the status of the response returned by the HTTP protocol.
Note that urllib2 can handle redirected pages (that is, response codes at the beginning of 3) for us, and numbers in the range of 100,299 indicate success, so we can only see error numbers of 400,599.
-improved version
Since the parent class of HTTPError is URLError, the exception of the parent class should be written after the exception of the subclass, so the above code can be rewritten as follows:
# urllib2_botherror.py
Import urllib2
Requset = urllib2.Request ('http://blog.baidu.com/itcast')
Try:
Urllib2.urlopen (requset)
Except urllib2.HTTPError, err:
Print err.code
Except urllib2.URLError, err:
Print err
Else:
Print "Good Job"
2 Requests module
Requests inherits all the features of urllib2. Requests supports HTTP connection persistence and connection pooling, cookie session persistence, file upload, automatic determination of response content encoding, and automatic encoding of internationalized URL and POST data.
The underlying implementation of requests is actually urllib3.
The documentation of Requests is very complete, and the Chinese documentation is also quite good. Requests can fully meet the needs of the current network, support Python 2.63.5, and can run perfectly under PyPy.
Open source address: https://github.com/kennethreitz/requests
Chinese document API: http://docs.python-requests.org/zh_CN/latest/index.html
GET request:
(1) the most basic GET request can directly use the get method.
Response = requests.get ("http://www.baidu.com/")
# it can also be written that way
# response = requests.request ("get", "http://www.baidu.com/")"
(2) add headers and query parameters
If you want to add headers, you can pass in the headers parameter to increase the headers information in the request header. If you want to pass parameters in url, you can take advantage of the params parameter.
Import requests
Kw = {'wd':' Great Wall'}
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
# params receives a query parameter of a dictionary or string, and the dictionary type is automatically converted to url encoding without the need for urlencode ()
Response = requests.get ("http://www.baidu.com/s?", params = kw, headers = headers)
# check the response content. Response.text returns data in Unicode format
Print response.text
# View the response content and the byte stream data returned by response.content
Print respones.content
# View the full url address
Print response.url
# View the character encoding of the response header
Print response.encoding
# View response codes
Print response.status_code
POST request:
(1) the most basic GET request can directly use the post method.
Response = requests.post ("http://www.baidu.com/", data = data)
(2) input data data
For a POST request, we generally need to add some parameters to it. Then the most basic parameter passing method can make use of the parameter data.
Import requests
Formdata = {
"type": "AUTO"
"I": "i love python"
"doctype": "json"
"xmlVersion": "1.8"
"keyfrom": "fanyi.web"
"ue": "UTF-8"
"action": "FY_BY_ENTER"
"typoResult": "true"
}
Url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null"
Headers= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
Response = requests.post (url, data = formdata, headers = headers)
Print response.text
# if it is a json file, it can be displayed directly
Print response.json ()
Proxy (proxies parameter)
If you need to use a proxy, you can configure a single request by providing a proxies parameter for any request method:
Import requests
# choose different agents according to the type of protocol
Proxies = {
"http": "http://12.34.56.79:9527","
"https": "http://12.34.56.79:9527","
}
Response = requests.get ("http://www.baidu.com", proxies = proxies)
Print response.text
You can also configure the agent through the local environment variables HTTP_PROXY and HTTPS_PROXY:
Export HTTP_PROXY= "http://12.34.56.79:9527"
Export HTTPS_PROXY= "https://12.34.56.79:9527"
Private agent authentication (specific format) and Web client authentication (auth parameter)
Urllib2's approach here is complicated, and requests only needs one step:
Private agent
Import requests
# if the agent needs to use HTTP Basic Auth, you can use the following format:
Proxy = {"http": "mr_mao_hacker:sffqry9r@61.158.163.130:16816"}
Response = requests.get ("http://www.baidu.com", proxies = proxy)
Print response.text
Web client authentication
For Web client verification, you need to add auth = (account name, password)
Import requests
Auth= ('test',' 123456')
Response = requests.get ('http://192.168.199.107', auth = auth)
Print response.text
Cookies and Sission
Cookies
If a response contains cookie, then we can use the cookies parameter to get:
Import requests
Response = requests.get ("http://www.baidu.com/")
# return CookieJar object:
Cookiejar = response.cookies
# convert CookieJar into a dictionary:
Cookiedict = requests.utils.dict_from_cookiejar (cookiejar)
Print cookiejar
Print cookiedict
Running result:
{'BDORZ':' 27315'}
Sission
In requests, the session object is a very common object that represents a user session: from the client browser connecting to the server, to the client browser disconnecting from the server.
Conversations allow us to maintain certain parameters across requests, such as maintaining cookie between all requests made by the same Session instance.
Realize the login of Renren
Import requests
# 1. Create a session object to save cookie values
Ssion = requests.session ()
# 2. Dealing with headers
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
# 3. User name and password required for login
Data = {"email": "mr_mao_hacker@163.com", "password": "alarmchime"}
# 4. Send a request with a user name and password, get the cookie value after login, and save it in ssion
Ssion.post ("http://www.renren.com/PLogin.do", data = data)
# 5. Ssion contains the cookie value after the user logs in, and you can directly access those pages that can only be accessed after login.
Response = ssion.get ("http://www.renren.com/410043129/profile")
# 6. Print response content
Print response.text
Process HTTPS request SSL certificate verification
Requests can also validate SSL certificates for HTTPS requests:
To check the SSL certificate of a host, you can use the verify parameter (or not write)
Import requests
Response = requests.get ("https://www.baidu.com/", verify=True)
# you can also omit it.
# response = requests.get ("https://www.baidu.com/")
Print r.text
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.