In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article is mainly about "a very watery Python code analysis". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "a very watery Python code analysis".
# _ * _ Encoding: UTF-8 _ * _ "python_spider.py of xianhu" Import request import urllib.errorimport urllib.parseimport urllib.request Import http.cookiejar
The first paragraph
# first define the following variables url = "https://www.baidu.com" headers = {"User-Agent": "Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT)"} # the simplest way to crawl web pages is response = urllib.request.urlopen (url,timeout = 10) html = response.read (). Decode ("utf-8") # use the request instance instead of the URL request = urllib.request.Request (url,data = None,headers = {}) response = urllib.request.urlopen (request,timeout = 10)
The second paragraph
# send data, that is, add data parameters data = urllib.parse.urlencode ({"act": "login", "email": "xianhu@qq.com", "password": "123456"}) to the request () request1 = urllib.request.Request (URL, data = data) # POST method request 2 = urllib.request.Request (URL + "? % S "% data) # GET method response = urllib.request.urlopen (request,timeout = 10) # send header, that is, add header parameter request = urllib.request.Request (url,data = data,headers = headers) # parameter header parameter request.add_header (" Referer "," http://www.baidu.com ") # another way to add header to the request () Referer is added to deal with "anti-hotlink" response = urllib.request.urlopen (request,timeout = 10)
The third paragraph
# Web page crawling throws an exception: urllib.error.HTTPError,urllib.error.URLError, there is an inheritance relationship between the two: urllib.request.urlopen (request,timeout = 10) except that urllib.error.HTTPError is e: print (e.codedage. printing) except that urllib.error.URLError is e: print (e.errnoree.printing) # use a proxy To prevent IP from being blocked or IP times limited: proxy_handler = urllib.request.ProxyHandler (proxies = {"http": "111.123.76.12 proxies 8080"}) opener = urllib.request.build_opener (proxy_handler) # create opener instance response with proxy = opener.open (URL) # Open URLurllib.request.install_opener (opener) directly with the opener instance # install global opener Then use urlopen to open the URL response = urllib.request.urlopen (url)
The fourth paragraph
# using cookies and cookiejar, the server should check cookie_jar = http.cookiejar.CookieJar () cookie_jar_handler = urllib.request.HTTPCookieProcessor (cookiejar = cookie_jar) opener = urllib.request.build_opener (cookie_jar_handler) response = opener.open (url) # send cookies obtained in the browser in two ways: # (1) put them directly into headers headers = {"User-Agent": "Mozilla / 4.0( compatible; MSIE 5.5) Windows NT) "," Cookie ":" PHPSESSID = btqkg9amjrtoeev8coq0m78396; USERINFO = n6nxTHTY%2BJA39z6CpNB4eKN8f0KsYLjAQTwPe%2BhLHLruEbjaeh5ulhWAS5RysUM%2B "} request = urllib.request.Request (url,headers = headers) # (2) build cookie and add cookie = http.cookiejar.Cookie (name =" xx ", value =" xx ", domain =" xx "to cookiejar) .) cookie_jar.set_cookie (biscuit) response = opener.open (url) # use both the agent and cookiejaropener = urllib.request.build_opener (cookie_jar_handler) opener.add_handler (proxy_handler) response = opener.open ("https://www.baidu.com/") # to grab a picture from a web page: it is also suitable for grabbing a file on the network, right-click, find the address in the image properties, and save it. Response = urllib.request.urlopen ("http://ww3.sinaimg.cn/large/7d742c99tw1ee7dac2766j204q04qmxq.jpg", timeout = 120) use open ("test.jpg", "wb") as file_img: file_img.write (response.read ()) # HTTP authentication: that is, HTTP authentication password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm () # create a PasswordMgrpassword_mgr.add_password (realm = None,uri = url,user = 'username' Passwd = 'password') # add username and password handler = urllib.request.HTTPBasicAuthHandler (password_mgr) # create HTTPBasicAuthHandleropener = urllib.request.build_opener (handler) # create opnerresponse = opener.open (url,timeout = 10) # get data # import socks import socket socks.setdefaultproxy using socket agent (socks.PROXY_TYPE_SOCKS5 "127.0.0.1", 1080) socket.socket = socks.socksocketrequests.get ("http://www.baidu.com/s?ie=utf-8&wd=ip") so far, I believe you have a deeper understanding of "a very watery Python code analysis". You might as well do it in practice! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.