In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail how Python crawls Meituan's food data. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.
1. Analyze the composition of url parameters of Meituan's food web page
1) search key points
Meituan cuisine, address: Beijing, search keywords: hot pot
2) crawled url
Https://bj.meituan.com/s/%E7%81%AB%E9%94%85/
3) description
Url will have the ability to encode Chinese automatically. So the word hotpot refers to this string of code% E7%81%AB%E9%94%85 that we don't know.
Parse the bj= Beijing in the current url through the url construction of the keyword city, / s / followed by the search keyword.
In this way, we can understand the current construction of url.
two。 Analyze page data sources (F12 developer tool)
Open the F12 developer tool and refresh the current page: you can see that when we switch to the second page, our url does not change, and the website does not automatically refresh and jump. (the ajax technology in web is the technology to load data under the condition that the page is not refreshed and the url is not changed.)
At this point, we need to find the response file corresponding to the current data in xhr in the developer tool.
From the analysis here, we can see that our data is exchanged in json format. Analyze the request address of the json file on page 2 and the request address of page 3 json file.
Page 2: https://apimobile.meituan.com/group/v4/poi/pcsearch/1?uuid=xxx&userid=-1&limit=32&offset=32&cateId=-1&q=%E7%81%AB%E9%94%85
Page 3: https://apimobile.meituan.com/group/v4/poi/pcsearch/1?uuid=xxx&userid=-1&limit=32&offset=64&cateId=-1&q=%E7%81%AB%E9%94%85
It is found that the offse parameter increases by 32 per page turn, and the limit parameter is the amount of data requested, offse is the starting element of the data request, and Q is the search keyword poi/pcsearch/1? One of them is the id number of the Beijing city.
3. Construct a request to capture Meituan's food data
Next, construct the request directly, iterating through the data on each page, and the final code is as follows.
Import requestsimport redef start (): for w in range (0, 1600, 32): # Page number x32 according to the actual situation. Here I set the upper limit of 50 pages. In order to avoid setting the page number too high or too little data, the maximum limit is defined as 1600-that is, 50 pages. Use try-except to detect the exception, and the exception skips the page. Generally skip this page as no data to deal with try: # Note that the parameter after uuid is free to replace the xxx after uuid with its own uuid parameter url = 'https://apimobile.meituan.com/group/v4/poi/pcsearch/1?uuid=xxx&userid=-1&limit=32&offset='+str(w)+'&cateId=-1&q=%E7%81%AB%E9%94%85' # headers The data can be viewed in requests_headers under the F12 developer tool You need to select the following headers information # it is recommended to add cookie parameters in headers headers = {'Accept':' * / *', 'Accept-Encoding':' gzip, deflate, br', 'Accept-Language':' zh-CN,zh if necessary QQ 0.9, 'Connection':' keep-alive', 'User-Agent':' Mozilla/5.0 (Windows NT 10.0 WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400', 'Host':' apimobile.meituan.com', 'Origin':' https://bj.meituan.com', 'Referer':' https://bj.meituan.com/s/%E7%81%AB%E9%94%85/'} response = requests.get (url, headers=headers) # regular get the data in the current response content Because the json method is unable to obtain the store-specific title key value, it uses regular titles = re.findall ('"," title ":" (. *?) "," address ":"', response.text) addresses = re.findall (', "address": "(. *?)",', response.text) avgprices = re.findall (', "avgprice": (. *?),', response.text) avgscores = re.findall (' "avgscore": (. *?),', response.text) comments = re.findall (', "comments": (. *?),', response.text) # whether the length of the current returned data is 32 print (len (titles), len (addresses), len (avgprices), len (avgscores)) Len (comments) for o in range (len (titles)): # Loop through each value written to the file title = titles [o] address = addresses [o] avgprice = avgprices [o] avgscore = avgscores [o] comment = comments [o] # write to the local file file_data (title, address, avgprice, avgscore Comment) # File writing method def file_data (title, address, avgprice, avgscore, comment): data = {'shop name': title, 'store address': address, 'average consumer price': avgprice, 'store rating': avgscore 'number of comments': comment} with open ('Meituan cuisine .txt', 'await, encoding='utf-8') as fb: fb.write (json.dumps (data, ensure_ascii=False) +'\ n') # ensure_ascii=False must be added because if the json.dumps method does not turn off transcoding, it will lead to garbled if _ _ name__ = ='_ main__': start ()
The running results are as follows:
This is the end of the article on "how Python crawls Meituan's food data". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.