Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python crawl preparation three urllib2 modules

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

urllib/urllib2 The default User-Agent is Python-urllib/2.7, which is easy to check as a crawler, so we need to construct a request object using the request method.

1. View Header information

2. Set User-Agent to impersonate browser access data

Request has a total of three parameters. In addition to the url parameter, there are the following two:

data (default empty): is the data submitted with the url (such as data to post), and HTTP requests will change from "GET" to "POST" mode.

headers (default empty): is a dictionary containing key pairs of HTTP headers to send

# _*_ coding: utf-8_*_import urllib2# User-Agent is the first step in crawler and anti-crawler ua_headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}#Construct a request object through urllib2.Request() method request = urllib2.Request ('http://www.baidu.com/', headers=ua_headers)#Send a request to the specified url address and return a class file object in response = urllib2.urlopen(request)#The class file object returned by the server supports the operation method of python file object # read() method is to read all the contents of the file and return the string html = response.read ()print html

3. Choose random use-agents

To prevent blocking IP, Mr. Wang creates a user-agent list and randomly selects one from it.

# _*_ coding: utf-8_*_import urllib2import randomurl = 'http:/www.baidu.com/'#You can try User-Agent list or proxy list ua_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]#Select a User-Agent at random from the User-Agent list. user_agent = random.choice(ua_list)#Construct a request = urllib2.Request(url)# add_Request header() method Add/modify an HTTP header request.add_header ('User-agent', user_agent)#get_header() Get the value of an existing HTTP header, note that only the first letter is upper-case, and the following letters are lower-case print request.get_header ('User-agent')

Main differences between urllib and urllib2

urllib and urllib2 are both related modules that accept URL requests, but provide different functionality, the most notable differences being as follows:

(1) urllib can only accept URLs, cannot create request class instances that set headers;

(2) But urllib provides urlencode() method for GET query string generation, while urllib2 does not (this is the main reason urllib and urllib2 are often used together)

(3) The encoding work uses urllib's urlencode() function to help us convert key-value pairs like key:value into strings like 'key= value'. The decoding work can use urllib's unquote()

function

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report