In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to use python to scratch web pages". In daily operation, I believe many people have doubts about how to use python to scratch web pages. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "how to use python to scratch web pages". Next, please follow the editor to study!
The basic flow of python crawler web pages:
First of all, select a part of carefully selected seeds URL.
Put these URL into the queue of URL to be crawled.
The URL of the queue to be crawled is read from the URL queue to be crawled, the DNS is parsed, and the IP of the host is obtained, and the web page corresponding to the URL is downloaded and stored in the downloaded web page library. In addition, put these URL into the captured URL queue.
Analyze the URL in the crawled URL queue, analyze the other URL from the downloaded web page data, and compare it with the crawled URL to remove the weight. Finally, the deduplicated URL is put into the URL queue to be crawled, thus entering the next cycle.
1. HTTP request implementation
Use urllib2/urllib to implement:
Urllib2 and urllib are two built-in modules in Python. In order to realize the function of HTTP, urllib2 is the main way and urllib is the secondary way.
Urllib2 provides a base function, urlopen, that fetches data by making a request to the specified URL. The simplest form is:
Import urllib2response=urllib2.urlopen ('http://www.zhihu.com')html=response.read()print html
In fact, the request response to http://www.zhihu.com can be divided into two steps, one is the request and the other is the response, in the form as follows:
Import urllib2
# request
Request=urllib2.Request ('http://www.zhihu.com')
# response
Response = urllib2.urlopen (request)
Html=response.read ()
Print html
There are also post request implementations:
Import urllibimport urllib2url = 'http://www.xxxxxx.com/login'postdata = {' username': 'qiye',' password': 'qiye_pass'} # info needs to be encoded into a format that urllib2 can understand. Here urllibdata = urllib.urlencode (postdata) req = urllib2.Request (url, data) response = urllib2.urlopen (req) html = response.read ()
Rewrite the above example, add the request header information, and set the User-Agent domain and referer domain information in the request header. 2. Request header headers processing
Import urllibimport urllib2url = 'http://www.xxxxxx.com/login'user_agent =' Mozilla/4.0 (compatible; MSIE 5.5 Windows NT) 'referer=' http://www.xxxxxx.com/'postdata = {' username': 'qiye',' password': 'qiye_pass'} # write user_agent,referer into the header message headers= {' User-Agent':user_agent,'Referer':referer} data = urllib.urlencode (postdata) req = urllib2.Request (url, data,headers) response = urllib2.urlopen (req) html = response.read ()
Urllib2 also handles Cookie automatically, using the CookieJar function to manage Cookie. If you need to get the value of a Cookie item, you can do this: 3. Cookie processing
Import urllib2
Import cookielib
Cookie = cookielib.CookieJar ()
Opener = urllib2.build_opener (urllib2.HTTPCookieProcessor (cookie))
Response = opener.open ('http://www.zhihu.com')
For item in cookie:
Print item.name+':'+item.value
But sometimes we encounter this situation. We don't want urllib2 to handle it automatically. We want to add the content of Cookie ourselves. We can set the request header.
Import urllib2
Opener = urllib2.build_opener ()
Opener.addheaders.append (('Cookie',' email=' + "xxxxxxx@163.com"))
Req = urllib2.Request ("http://www.zhihu.com/")
Response = opener.open (req)
Print response.headers
Retdata = response.read () at this point, the study on "how to use python to pick web pages" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.