Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use requests Library in python Crawler

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to use the requests library in the python crawler, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Usage of python crawler-requests library

Requests is an easy-to-use HTTP library implemented by python, which is much simpler to use than urllib. Requests allows you to send HTTP/1.1 requests. Specify URL and add query url string to start crawling web page information and other operations

Because it is a third-party library, you need to install cmd before using it.

Pip install requests

After the installation is completed, import will indicate that you can start using it normally.

Basic usage:

Requests.get () is used to request the target website, and the type is a HTTPresponse type

Import requestsresponse = requests.get ('http://www.baidu.com')print(response.status_code) # print status code print (response.url) # print request urlprint (response.headers) # print header information print (response.cookies) # print cookie information print (response.text) # print web page source code print (response.content) # print as byte stream

Take the print status code as an example, run the result:

Status code: 200, which proves that the target website of the request is normal

If the status code is 403, the target usually has a firewall, which triggers the anti-crawl policy and restricts the IP.

Various request methods:

Import requestsrequests.get ('http://www.baidu.com')requests.post('http://www.baidu.com')requests.put('http://www.baidu.com')requests.delete('http://www.baidu.com')requests.head('http://www.baidu.com')requests.options('http://www.baidu.com') basic get request import requestsresponse = requests.get (' http://www.baidu.com')print(response.text))

GET request with parameters:

The first is to put the parameters directly in the url

Import requestsresponse = requests.get ("https://www.crrcgo.cc/admin/crr_supplier.html?params=1")print(response.text)

The other is to fill in the parameter in data first, and specify the params parameter as data when initiating the request.

Import requestsdata = {'params':' 1century,} response = requests.get ('https://www.crrcgo.cc/admin/crr_supplier.html?', params=data) print (response.text)

Basic POST request:

Import requestsresponse = requests.post ('http://baidu.com')

Parse jsonimport requestsresponse = requests.get ('http://httpbin.org/get')print(response.text)print(response.json()) # response.json () method is the same as json.loads (response.text) print (type (response.json ()

Simply save a binary file

Import requestsresponse = requests.get ('http://img.ivsky.com/img/tupian/pre/201708/30/kekeersitao-002.jpg')b = response.contentwith open (' FRV _ JPG') as f: f.write (b)

Add header information to your request

Import requestsheads = {} heads ['User-Agent'] =' Mozilla/5.0'\'(Macintosh; U; Intel Mac OS X 10'6'8; en-us) AppleWebKit/534.50'\'(KHTML, like Gecko) Version/5.1 Safari/534.50' response = requests.get ('http://www.baidu.com',headers=headers)

This method can effectively avoid the detection of the firewall and hide one's identity.

Use proxy

Just like adding the headers method, the proxy parameter is also a dict where the requests library is used to crawl the IP and port and type of the IP proxy site. Because it is free, the proxy address used will soon become invalid.

Import requestsimport redef get_html (url): proxy = {'http':' 120.25.253.234 https' '163.125.222.244 import requestsimport redef get_html 8123'} heads = {} heads ['User-Agent'] =' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.x MetaSr 1.0 'req = requests.get (url, headers=heads,proxies=proxy) html = req.text return htmldef get_ipport (html): regex = r' (. +) 'iplist = re.findall (regex, html) regex2 =' (. +) 'portlist = re.findall (regex2) Html) regex3 = r'(. +) 'typelist = re.findall (regex3, html) sumray = [] for i in iplist: for p in portlist: for t in typelist: pass pass a = tweak' '+ I +':'+ p sumray.append (a) print ('Gao Ning Agent') print (sumray) if _ _ name__ ='_ _ main__': url = 'http://www.baidu.com' get_ipport (get_html (url)) get cookieimport requestsresponse = requests.get (' http://www.baidu.com')print(response.cookies)print(type(response.cookies))for k V in response.cookies.items (): print (kryptonite)

Session maintenance import requestssession = requests.Session () session.get ('https://www.crrcgo.cc/admin/crr_supplier.html')response = session.get (' https://www.crrcgo.cc/admin/')print(response.text) Certificate Verification Settings import requestsfrom requests.packages import urllib3urllib3.disable_warnings () # eliminate warning response = requests.get ('https://www.12306.cn',) from urllib3 Verify=False) # Certificate verification is set to FALSEprint (response.status_code) timeout exception capture import requestsfrom requests.exceptions import ReadTimeouttry: res = requests.get ('http://httpbin.org', timeout=0.1) print (res.status_code) except ReadTimeout: print (timeout) exception handling

Use try... Except to catch exceptions

Import requestsfrom requests.exceptions import ReadTimeout,HTTPError,RequestExceptiontry: response = requests.get ('http://www.baidu.com',timeout=0.5) print (response.status_code) except ReadTimeout: print (' timeout') except HTTPError: print ('httperror') except RequestException: print (' reqerror') these are all the contents of this article entitled "how to use requests Library in python Crawler". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report