In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly shows you "how Python climbs KFC". The content is simple and clear. I hope it can help you solve your doubts. Let the editor lead you to study and learn this article "how Python climbs KFC".
Preparatory work
Check the request method of KFC's official website: post request.
X-Requested-With: XMLHttpRequest decided that the official website of KFC was an ajax request.
Through these two preparation steps, make clear the goal of this crawler:
Ajax's post asked KFC's website to get the top 10 pages of KFC locations in Shanghai.
Analysis.
To get the top 10 pages of Shanghai KFC location, you need to analyze the url of each page first.
Page one
# page1# http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname# POST# cname: Shanghai # pid:# pageIndex: Shanghai pageSize: 10
Page two
# page2# http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname# POST# cname: Shanghai # pid:# pageIndex: Shanghai pageSize: 10
Page 3, and so on.
Program entry
First, review the basic operations of urllib crawling:
# use urllib to get the source code of Baidu home page import urllib.request# 1. Define a url, which is the address you want to visit url = 'http://www.baidu.com'# 2. Simulate the browser to send the request response response to the server response = urllib.request.urlopen (url) # 3. Get the source content content of the page in response # read method returns binary data in byte form # convert binary data to string # binary-- > string decoding decode method content = response.read (). Decode ('utf-8') # 4. Print data print (content)
Define a url, which is the address you want to access
Simulate the browser to send the request response response to the server
Get the source content content of the page in the response
If _ _ name__ = ='_ main__': start_page = int (input ('Please enter the starting page number:') end_page = int (input ('please enter the end page number:') for page in range (start_page End_page+1): # Custom request of request object = create_request (page) # get web page source code content = get_content (request) # download data down_load (page, content)
Correspondingly, we also have similar declaration methods in the main function.
Url component data location
The key to a crawler is to find an interface. For this case, the json data corresponding to the page can be found on the preview page, indicating that this is the data we want.
Construct url
It is not difficult to find that the official website of KFC url has one thing in common, we save it as base_url.
Base_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
Parameters.
As usual, looking for rules, only 'pageIndex'' has something to do with page numbers.
Data = {'cname':' Shanghai', 'pid':', 'pageIndex': page,' pageSize': '10'} post request
The parameters of the post request must be encoded
Data = urllib.parse.urlencode (data) .encode ('utf-8')
The encode method must be called after encoding
Parameters are placed in the method customized by the request object: the parameters of the post request are not spliced after the url, but in the parameters customized by the request object.
So encode the data
Data = urllib.parse.urlencode (data). Encode ('utf-8') header acquisition (a means to prevent anti-crawling)
That is, the UA part of the response header.
User Agent, user agent, special string header enable the server to identify the operating system and version used by the customer, CPU type, browser and version, browser kernel, browser rendering engine, browser language, browser plug-in and so on.
Headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38'} request object customization
Once the parameters, base_url, and request headers are all ready, you can customize the request object.
Request = urllib.request.Request (base_url, headers=headers, data=data) to get the source code of the web page
Take the request request as a parameter and simulate the browser to send the request to the server to get the response response.
Response = urllib.request.urlopen (request) content = response.read () .decode ('utf-8') get the source code of the page in the response and download the data
Using the read () method, you get binary data in byte form, which needs to be decoded using decode and converted to a string.
Content = response.read () .decode ('utf-8')
Then we write the downloaded data into the file, and using the syntax of with open () as fp, the system automatically closes the file.
Def down_load (page, content): with open ('kfc_' + str (page) + .json','w' Encoding='utf-8') as fp: fp.write (content) full code # ajax post requests the official website of KFC to obtain the first 10 pages of Shanghai KFC location # page1# http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname# POST# cname: Shanghai # pid:# pageIndex: "pageSize: 1" page2# http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname# POST# cname: Shanghai # pid:# pageIndex: Shanghai pageSize: 10import urllib.request Urllib.parsedef create_request (page): base_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname' data = {' cname': 'Shanghai', 'pid':', 'pageIndex': page 'pageSize':' 10'} data = urllib.parse.urlencode (data) .encode ('utf-8') headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38'} request = urllib.request.Request (base_url, headers=headers, data=data) return requestdef get_content (request): response = urllib.request.urlopen (request) content = response.read (). Decode ('utf-8') return contentdef down_load (page, content): with open (' kfc_' + str (page) + '.json' 'walled, encoding='utf-8') as fp: fp.write (content) if _ _ name__ = ='_ _ main__': start_page = int (input ('please enter the starting page number:') end_page = int (input ('please enter the end page number:') for page in range (start_page End_page+1): # Custom request of request object = create_request (page) # get web page source code content = get_content (request) # download data down_load (page, content)
Result after crawling
The above is all the contents of the article "how Python crawls KFC". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.