In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces how to use python to crawl Meituan data, the article is very detailed, has a certain reference value, interested friends must read it!
1. Overview
In fact, the overall crawler of Meituan is relatively simple. After finding the real data request address through the developer mode, the data format requested with requests is a standard json string, which is very easy to handle.
In this article, we will introduce two common ways to obtain data, one is to obtain results through search, and the other is to obtain results through filtering. There is a slight difference between the two methods in how to obtain the real data request address, as we can see in the following section.
two。 Search results data collection
If you want to search for results on the PC page, you need to log in to your account before you can search.
2.1. Get the real data request address
As shown in the following picture, we first enter search keywords such as "hot pot" in the search box. Then take the following steps: (take Google browser as an example)
Press F12 to enter developer mode
Click Network- > XHR at the top of the developer mode page on the right
Drag to the bottom of the search results page on the left and click on the next page (such as 2)
Then click the Name on the left side of the developer mode page on the right, and click on the first uuid to show the information we need.
Get the real data request address
We can find that Request URL is the real data request address, and its basic part looks like (Beijing) https://apimobile.meituan.com/group/v4/poi/pcsearch/1?...
The final link address also needs the following parameters: you can see it in Query String Parameters.
Uuid: # your uuid, log in and get it in developer mode
Userid: # your userid, log in and get it in developer mode
Limit: 32 # number of store information per page
Offset: 32 # current offset, page 1 is 0, page 2 is (2-1) * limit
CateId:-1 # type
Q: keyword for hot pot # search
Token: htN9Eu125pjzVK798YMMXMkJDi8AAAAAZgwAAAW6Dw8Qi2ZMfzNT1glWhl_WHtjjasoQfOZFt_VQdtMpw4VHEYL5DNiMixUTOVxTPw # token. After testing, it is found that there is no need.
If you select other filter items, you will find that the corresponding parameters may increase or decrease, then you can set the parameters according to your own needs.
2.2. Request data (requests)
Here you only need to use two libraries, requests request data and json parse json data strings.
Import requestsimport json# takes Beijing as an example. The basic link is as follows: base_url = 'https://apimobile.meituan.com/group/v4/poi/pcsearch/1?'# Shanghai as follows. The difference lies in the number before the last question mark. Shanghai is 10, Beijing is "https://apimobile.meituan.com/group/v4/poi/pcsearch/10?uuid = xxxx # your uuid. Log in and get userid = xxx # your userid in developer mode. Get key in developer mode after login = 'hotpot' # here demonstrate the data on the first page of the request page = setting request parameters parameters = {'uuid': uuid, # your uuid, get' userid': userid, # your userid in developer mode after login Get 'limit':32, # number of store information per page' offset':32* (page-1) in developer mode after login, # current offset, page 1 is 0 Page 2 is (2-1) * limit 'cateId':-1, #' Q search: key, # search keywords} # set the request header header = {"Accept-Encoding": "Gzip", # use gzip to compress and transfer data to make access faster "User-Agent": "Mozilla/5.0 (Windows NT 10.0) Win64; x64 Rv:83.0) Gecko/20100101 Firefox/83.0 ",} # get method requests web page data re = requests.get (base_url, headers = header, params=parameters) # re.text is the data we need text = re.text# because it is a string in json format Format js = json.loads (text) # with the json.load () method in js ['data'] data = js [' data'] searchResult = data ['searchResult'] # result list
The result is actually very structured.
2.3. Parsing to get the data you need
Because there are 32 sets of data on a single page and there are many data fields in each group, we use lists to store single-page data and dictionaries to store the information of a single store.
Considering that I want to record the coupon sales information of each store, and because the coupons are stored in the form of a list, here I deal with the coupon record information separately and combine other basic information to form a piece of data. Therefore, there will be multiple items in the same store in the final data, but the coupon information is different.
# followed by 2.2. Result shops = [] for dic in searchResult: shop = {} shop ['id'] = dic [' id'] shop ['store name'] = dic ['title'] shop [' address'] = dic ['address'] shop [' region'] = dic ['areaname'] shop [' average consumption'] = dic ['avgprice'] shop [' score'] = dic ['avgscore'] shop ['evaluations'] = dic ['comments'] shop [' type'] = dic ['backCateName'] # shop [' preferential'] = dic ['deals'] shop [' longitude'] = dic ['longitude'] shop [' latitude'] = dic ['latitude'] shop [' minimum consumption'] = dic ['lowestprice'] if dic [' deals'] = None: shops.append (shop) else: for deal in dic ['deals']: # deal = dic [' deals'] [1] shop_info = shop.copy () shop_info ['optimized id'] = deal [' id'] shop_info ['coupon name'] = deal ['title'] shop_info [' original value of coupon '] = deal [' value'] shop_info ['coupon price'] = deal ['price'] shop_info [' coupon sales'] = deal ['sales'] shops.append (shop_info)
Store data list
2.4. Store the results locally (csv file)
Every time we get a page of data, we immediately write it locally, because the data on each page ends up with a list of dictionaries, and we can directly use pandas's Dataframe.to_csv () method to append the write.
# followed by 2.3. The result in import pandas as pddf = pd.DataFrame (shops) # needs to store the header, so for the data on the first page, we store if page = = 1: df.to_csv ('hot pot store data .csv, index=False, mode='a+',) # not the first page, and our stored data does not store the header. Mode='a+' is the append mode else: df.to_csv ('hotpot store data .csv', header=False,index=False, mode='a+',)
Exported data sheet
3. Data acquisition of classification and screening results
When classifying and filtering, we do not need to log in to the account to get the data. However, through the analysis, it is found that the real data request address in this case has the real data request address under other search results, and it is a little more troublesome, as we can see in the following analysis.
3.1. Get the real data request address
And 2. As in the process of data collection of search results, perform the following steps: (take Google browser as an example)
Press F12 to enter developer mode
Click Network- > XHR at the top of the developer mode page on the right
Refresh the page or click the next page (such as 2)
Then click Name on the left side of the developer mode page on the right and click getPolist? You can get the information we need.
Get the real data request address
We can find that Request URL is the real data request address, and its basic part looks like (Beijing) https://bj.meituan.com/meishi/api/poi/getPoiList?
The following parameters are required for the final link address: you can see it in * * Query String Parameters * *.
CityName: Beijing # City
CateId: 17 # classification
AreaId: 0 # region
Sort: # sort type
DinnerCountAttrId: # number of diners
Page: 2
UserId:
Uuid:
Platform: 1 # platform
Partner: 126
OriginUrl: https://bj.meituan.com/meishi/c17/pn2/
RiskLevel: 1
OptimusCode: 10
_ token:
We found that in the above parameters, with the page turning, page and originUrl will change regularly, and _ token will also change, and this value needs to be obtained by special processing. Other parameters are fixed after you determine the page filter.
3.2. _ token parsing and generation
I refreshed it three times and got three token as follows. Let's try to introduce what parameter information is contained in it.
Token = ['eJx1T01vqkAU/S+zhTjDlzDusOLriLYqKkrTBTMgwyCKQBFt+t/fNGkXb/GSm5yPe3Jy7yeoSQJGGkIYIRV0aQ1GQBugwRCooG3kZogcjGyMho4lA+xfDxvSo/VuAkZvtm6p2MLv38Za6jcN60jVkIPe1V9uSq6bcr5TRIYAb9uqGUFIxaBM8/YjPg/YpYSSNzyHTLOhPOQ/ISBbyo1skVj8YPyD7a9eyIdkRZNnZ8nS2S0RTGtd4a34vuOm/wK3874mXbFih8uWbZtxJNDcw1SPRf8nyBSruhXXy9i+EC16tfFyo+DCyzo37dbR5vDKs7Xw+LHHT24HHWVvMHqf5Y9iYiPhw8VuscwpuVcnaswOxAj55slTomXY3aM01ruA6zBy58W09llw8pNhf95F3uGknSfXKrC9YxJTO5iEs+OY+c8hbaamYrrV3NKupAuzHRSY3hTRPjhqxXOkZ0bZOziaohey0OsyaZlFiL3flg9IDJTr4OsvJZyVbQ==','eJyFT8luo0AQ/Ze+GrkXzNKWcrABB/DYBLPYTJSDWYJpDGZYTMJo/j0dKTnkNFJJb6mnp6q/oLVSsMQIUYQEcM9asAR4juYyEEDf8Y2MVIoUKhKJYgEkPz1ZpgKI21AHy2eFSAKV6MunceD6GVOCBIxU9CJ88wXnZMHnM2XxELj0fdMtIYzZvMqKfjjX8+RWQc67SwETrMCmJpAf8/8g4KWVz0s5ll94/sL+W+/4f7ytK/Kas8we08nvnbVuuOu9psSO7tZvj5nq2TjXf2tu5Sajfh/IERNom+lhsBxp864Z4iobVskxiEkemqdx84rZSsPHabrkm9bw7hNUiiddUV/rCXqB1kQ+vpr3Yj2LZK+wTk3P4q2bGDd0HNKtiYI6W+yalLTUl8IgKhWm2PsgweJWLsPWuZVFdC1aNfe13R9GyhG9T1ulrn/tI+twTdteS+loElna3VimPs3EdGxyNpO6JH6MK5WkohedAqObzrPMwezGWHrwTLhYqSS0af7wAP59AGUUnIY=','eJyNT8tugkAU/ZfZQpwBxGFMuqhPQOUhYgpNFwgUUWAsDGBp+u+dJu2iuyY3OY97cnLvB6iNBEwlhAhCIujSGkyBNEKjCRABa/hmgjSCMFEnElFFEP/1ZCyJ4FQfF2D6jGVVJCp5+Tb2XD9LREaihDT0Iv7yMefymM93yuAhcGbs1kwhPF1GZZqzNqpGMS0h6805h7GE4a1SID/mP0EZAl5cHngxx+sPRj/IfvWO/8gbmzyrOEvNPhkOkv04LN1ZOsc38+I5fZUm4Symq72RG+OFbXfhfa0TCpdBfujcuC/sXq8yvTj7+L1NFcZUt9OCytiudj2m6/BkYE0T6GtHBGdQITXcGAWTIsRqFhbruV8XpU5xEczszXvmb4c1RHbUDtunxFNKIQ+8fauRfONfbSa3fq6qxeDGpnSEru4H7kUuK3J3/OHVijaLq7WyunqJ8dli/aYJnFRzBIXc89hO3oY+KZuEmUrIlFO2rKl3LLC1C99o4HtRF+tkXpjC48MD+PwC/5KgHA==']
For these three token, the encryption algorithm is relatively simple, which uses binary compression and base64 coding.
3.2.1. Analysis
The operation step when parsing token is to decode base64 and then decompress it. Base64 library and zlib library are needed here.
Import base64import zlibfor s in token: temp = base64.b64decode (s) result = zlib.decompress (temp) print (result) b' {"rId": 100900, "ver": "1.0.6", "ts": 1608907906850, "cts": 1608907906930, "brVD": [725959], "brR": [1920 1080], [1920 1040], 24 24], "bI": ["https://bj.meituan.com/meishi/c17/","https://bj.meituan.com/"]," "mT": [], "kT": [], "aT": [], "tT": [], "aM": "", "sign": "eJwdjc1tAjEQhXvh5KN/ULxrIvkQcYoUcUsBZj0LE9b2ajxGSg+5pwkqoB7oI1ZO79PT+9kEgvAevRZTYOhgRjEhfx9CAv/8+X3cbyJizkD70jK/MVMPibIyplb3JYI3WhTCE+ZPWvyZea2vSh3/ZALkFrKcSlKd6xnVZEYl1nDqpS7Efdab7SDWJfBcKHWbsF4+4ApL51qIvWgV/j9bw+jtzh0tjHZ2g3mx89ZF0NIM2rmdtc5II7XUmz/I30i2"}'b' {"rId": 100900, "ver": "1.0.6", "ts": 1608907932591, "cts": 1608907932669, "brVD": [725959], "brR": [1920J 1080], [1920J 1040], 24J 24] "bI": ["https://bj.meituan.com/meishi/c17/pn2/","https://bj.meituan.com/meishi/c17/"],"mT":[],"kT":[],"aT":[],"tT":[],"aM":"","sign":"eJwdzTtOBDEQBNC7bODQnxGe8SJ1gDZCQmQcwDvu2W12/JHdRuIO5FyCE3AeuAcWUb2gVHXwFf1jAC1WzzhgFrESvz/7iPD78fnz/SUCpYT1lHviB+Y6SiIXptjbKQcEo0WudKH0Une4Mpd2r9T5VUYk7j7JNUc13K6kVrOokiYlir8gTCMqj2kw0yzK7nnLNYIRldrtCd9wH265Moje8P+3dwpgj+5scbGbm82d3SYXUEsza+eO1jojjdRSH/4A82VJ9g=="}'b'{"rId":100900,"ver":"1.0.6","ts":1608907956195,"cts":1608907956271," "brVD": [725959], "brR": [[1920 https://bj.meituan.com/meishi/c17/pn3/","https://bj.meituan.com/meishi/c17/pn2/"],"mT":[],"kT":[],"aT":[],"tT":[],"aM":"","sign":"eJwdzT1OAzEQBeC7pJjSPwnedZBcoFRIiI4DOOvZxGH9o/EYiTvQcwlOwHngHlhU7yue3tt5Qv8YnILFMw7oGZbI788+ofv9+Pz5/oIQc0Y6lZ75gZlGCUrlmHo7lYBOKygULzG/0OauzLXdS3m+iYSRu89iKUkOt2uUi55lzQcJ1V/QHUYQj2mn9xPUzfNaKDkNFNvrE77hNtwKsYPe8P+39xicOdqzwdmsdtJ3Zt3bgEroSVl7NMZqoYUSavcH9ClJ+A=="}' 1080], [1920 record1040], 24], "bI": ["https://bj.meituan.com/meishi/c17/pn3/","https://bj.meituan.com/meishi/c17/pn2/"],"mT":[],"kT":[],"aT":[],"tT":[],"aM":"","sign":"eJwdzT1OAzEQBeC7pJjSPwnedZBcoFRIiI4DOOvZxGH9o/EYiTvQcwlOwHngHlhU7yue3tt5Qv8YnILFMw7oGZbI788+ofv9+Pz5/oIQc0Y6lZ75gZlGCUrlmHo7lYBOKygULzG/0OauzLXdS3m+iYSRu89iKUkOt2uUi55lzQcJ1V/QHUYQj2mn9xPUzfNaKDkNFNvrE77hNtwKsYPe8P+39xicOdqzwdmsdtJ3Zt3bgEroSVl7NMZqoYUSavcH9ClJ+A=="}'"
According to the three token parsing results, we find that it has many parameters, but only some of them change, where ts and cts are timestamps. Take the first one as an example:
Ts = 1608907906850
Ts
Cts = 1608907906930
Cts
Since both timestamps are in millisecond scale, they are actually the same point in time, with a difference of about 90 milliseconds, so we can get the current time and then get the ts and cts parameter values when making the request.
Another change is BI, which is the original address of the current web page and pre-ordered web pages, and can also be generated regularly.
Finally, we find that sign is also changing, and much like token, we can find its pattern in the same parsing way:
S = 'eJwdjc1tAjEQhXvh5KN/ULxrIvkQcYoUcUsBZj0LE9b2ajxGSg+5pwkqoB7oI1ZO79PT+9kEgvAevRZTYOhgRjEhfx9CAv/8+X3cbyJizkD70jK/MVMPibIyplb3JYI3WhTCE+ZPWvyZea2vSh3/ZALkFrKcSlKd6xnVZEYl1nDqpS7Efdab7SDWJfBcKHWbsF4+4ApL51qIvWgV/j9bw+jtzh0tjHZ2g3mx89ZF0NIM2rmdtc5II7XUmz/I30i2'temp = base64.b64decode (s) result = zlib.decompress (temp) print (result) b' "areaId=0&cateId=17&cityName=\ xe5\ x8c\ x97\ xe4\ xba\ xac&dinnerCountAttrId=&optimusCode=10&originUrl= https://bj.meituan.com/meishi/c17/&page=1&partner=126&platform=1&riskLevel=1&sort=&userId=&uuid=598b5e75f86145f28de0.1608895581.1.0.0"'
Coincidentally, we found that it is actually a combination of parameters, of which the originUrl and page parameters are associated with the current access page.
3.2.2. Generate
Once we know the token parsing process, the reverse process can generate token. We select a parsed token format and build the generation method.
# parsed token format {'rId': 100900,' ver': '1.0.6,' ts': 1608907906850, 'cts': 1608907906930,' brVD': [725,959], 'brR': [[1920,1080], [1920,1040], 24,24],' bI': ['https://bj.meituan.com/meishi/c17/',' https://bj.meituan.com/'], 'mT':] 'kT': [], 'aT': [],' tT': [], 'aM':', 'sign':' eJwdjc1tAjEQhXvh5KN/ULxrIvkQcYoUcUsBZj0LE9b2ajxGSg+5pwkqoB7oI1ZO79PT+9kEgvAevRZTYOhgRjEhfx9CAv/8+X3cbyJizkD70jK/MVMPibIyplb3JYI3WhTCE+ZPWvyZea2vSh3/ZALkFrKcSlKd6xnVZEYl1nDqpS7Efdab7SDWJfBcKHWbsF4+4ApL51qIvWgV/j9bw+jtzh0tjHZ2g3mx89ZF0NIM2rmdtc5II7XUmz/I30i2'}
For sign, let's also look at the parsed result:
B'"areaId=0&cateId=17&cityName=\ xe5\ x8c\ x97\ xe4\ xba\ xac&dinnerCountAttrId=&optimusCode=10&originUrl= https://bj.meituan.com/meishi/c17/&page=1&partner=126&platform=1&riskLevel=1&sort=&userId=&uuid=598b5e75f86145f28de0.1608895581.1.0.0"'
Therefore, we construct the sign before constructing the token:
S_sign = f'"areaId=0&cateId=17&cityName= Beijing & dinnerCountAttrId=&optimusCode=10&originUrl= {Url} & page= {page} & partner=126&platform=1&riskLevel=1&sort=&userId=&uuid=598b5e75f86145f28de0.1608895581.1.0.0"'
After the city is selected, the sign variable parameters are originUrl and page.
Regenerate into sign:
# construct signUrl = 'https://bj.meituan.com/meishi/c17/'page = 1s_sign = f' "areaId=0&cateId=17&cityName= Beijing & dinnerCountAttrId=&optimusCode=10&originUrl= {Url} & page= {page} & partner=126&platform=1&riskLevel=1&sort=&userId=&uuid=598b5e75f86145f28de0.1608895581.1.0.0"'# binary coding encode = s_sign.encode () # binary compression compress = zlib.compress (encode) # base64 coding b_encode = base64.b64encode (compress) # To the string sign = str (b_encode Encoding='utf-8')
The sign that can be used to construct token is as follows:
'eJwdjc1tAjEQhXvh5KN/ULxrIvkQcYoUcUsBZj0LE9b2ajxGSg+5pwkqoB7oI1ZO79PT+9kEgvAevRZTYOhgRjEhfx9CAv/8+X3cbyJizkD70jK/MVMPibIyplb3JYI3WhTCE+ZPWvyZea2vSh3/ZALkFrKcSlKd6xnVZEYl1nDqpS7Efdab7SDWJfBcKHWbsF4+4ApL51qIvWgV/j9bw+jtzh0tjHZ2g3mx89ZF0NIM2rmdtc5II7XUmz/I30i2'
Then construct the token:
# connected with signfrom datetime import datetimets = int (datetime.now (). Timestamp () * 1000) beforeUrl = 'https://bj.meituan.com/'bI = [Url, beforeUrl] dic_token = {' rId': 100900, 'ver':' 1.0.6, 'ts': ts,' cts': ts+90, 'brVD': [725,959],' brR': [[1920, 1080], [1920, 1040], 24, 24] 'bI': bI, 'mT': [],' kT': [], 'aT': [],' tT': [], 'aM':'' 'sign':sign} # binary encoding encode = str (dic_token). Encode () # binary compression compress = zlib.compress (encode) # base64 encoding b_encode = base64.b64encode (compress) # into the string token = str (b_encode, encoding='utf-8')
The token is as follows:
'eJx1j01zqjAARf+KOxZ0TMKHMd1hxTairRUVpdMFBASCKAKNaKf//ZEufDNv5i0yc+7JnTvJt1LRSHnsIQgJhA89RcRVFxXUh/2B0uWmlrcDOCRdAcOBiTrJ/rHYkDasNuNOf2DNfOgRk3z+uqVUH4ho3TqCQ9jZezJk0gx5fstUdpW0acr6EYCQ94s4a76CY5+dCtBxnWaAIQzkw/7TUuRQsZJDkvI7BXdq/rq5/Ktcq7PkKDmeXiLOUGNx+z3ditRwXsF61lZU5O9sd1qzdT3yOZzZJNQC3j67iWqWl/x8GuETRf4bJouVSnI7EVYslv5q95YmS26n+5Y8WQIM1a3Owus0u+VjDLkD5pv5IgvptTyE+nRHdS9dPdmqv/DE1Y8DTbipBnxrlk8qh7kHJxq0x41v7w7oOD6XLrb3URBid+xN9yPmvHhhPTFUwypnJjpT4SUbwEl4UXlzS2HDX3wt0Yt2SPwJfKVzrSqihpmU4u26uAGqw0xTfv4AOoWaFA=='3.3. Request data (requests)
Just like the search results data collection, the data is requested with requests, and the json data string is parsed by json.
Import requestsimport jsonimport base64import zlib# takes Beijing as an example. The basic link is as follows: base_url = 'https://bj.meituan.com/meishi/api/poi/getPoiList?'uuid =' 598b5e75f86145f28de0.1608895581.1.0.0' # your userid Get the data in developer mode after login # here demonstrate the data on the first page of the request page= construct signUrl = 'https://bj.meituan.com/meishi/c17/'s_sign = f' "areaId=0&cateId=17&cityName= Beijing & dinnerCountAttrId=&optimusCode=10&originUrl= {Url} & page= {page} & partner=126&platform=1&riskLevel=1&sort=&userId=&uuid=598b5e75f86145f28de0.1608895581.1.0.0"' # binary coding encode = s_sign.encode () # binary Compress compress = zlib.compress (encode) # base64 Encoding b_encode = base64.b64encode (compress) # into the string sign = str (b_encode Encoding='utf-8') # connected with signfrom datetime import datetimets = int (datetime.now (). Timestamp () * 1000) beforeUrl = 'https://bj.meituan.com/'bI = [Url, beforeUrl] dic_token = {' rId': 100900, 'ver':' 1.0.6, 'ts': ts,' cts': ts+90, 'brVD': [725,959],' brR': [[1920, 1080], [1920, 1040] 24, 24], 'bI': bI,' mT': [], 'kT': [],' aT': [], 'tT': [],' aM':' 'sign':sign} # binary encoding encode = str (dic_token). Encode () # binary compression compress = zlib.compress (encode) # base64 encoding b_encode = base64.b64encode (compress) # converted to string token = str (b_encode, encoding='utf-8') # set request parameters parameters = {' cityName': 'Beijing', # City 'cateId': 17 # category 'areaId': 0, # region' sort':', # sort type 'dinnerCountAttrId':', # number of diners' page': page, 'userId':', 'uuid': uuid,' platform': 1, # platform 'partner': 126,' originUrl': Url, 'riskLevel': 1,' optimusCode': 10,'_ token': token } # set request header header = {"Accept-Encoding": "Gzip", # use gzip to compress and transfer data for faster access "User-Agent": "Mozilla/5.0 (Windows NT 10.0) Win64; x64 Rv:83.0) Gecko/20100101 Firefox/83.0 ",} # get method requests web page data re = requests.get (base_url, headers = header, params=parameters) # re.text is the data we need text = re.text# because it is a string in json format Format js = json.loads (text) # with the json.load () method in js ['data'] data = js [' data'] poiInfos = data ['poiInfos'] # result list
Result list
The above is all the contents of the article "how to use python to crawl Meituan data". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.