Python crawls Ctrip and Mayhive scenic spot review data\ python crawls Ctrip review data\ number of python travel website reviews 04/16 Update SLTechnology News&Howtos

Python crawls Ctrip and Mayhive scenic spot review data\ python crawls Ctrip review data\ number of python travel website reviews

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I have long sold a large amount of Weibo data, travel website review data, and provide a variety of designated data crawling services, Message to YuboonaZhang@Yahoo.com. Also welcome to join the social media data exchange group: 99918768

Preface

In order to obtain multi-source data, needs to go to various websites to obtain comment information and pictures of some scenic spots. First of all, Ctrip and Maybee Cave are selected to record some crawling process.

Ctrip analysis data

first, let's go to Ctrip's Gulangyu scenic spot to take a quick look at the page we want to climb. We probably found that there are dozens of scenic spots, and the structure of each scenic spot should be similar, so we chose the first scenic spot to go in and see how the specific page should be crawled.

What we need is the part of the red circle, it is easy to know that the comment page is dynamically loaded, so we can not directly use bs4 or regular to extract elements directly, we need to analyze the interface of page dynamic transmission. Open the chrome review element, switch to network to view the transmitted content, first clear the content to avoid interference, and then click on the next page, we can get

By looking at the returned data, we can get that this is the interface we want, using post for transmission, and the transmitted Form Data has a lot of fields, which can be roughly guessed.

PoiID is the scenic spot poiID pagenow is the current number of pages star is a score of 1-5 star 0 represents all resourceId is a corresponding value for each resource

When crawling, you only need to change these values to crawl content according to your own needs, but the things you need to pay attention to is that Ctrip's pagenow can only get up to 100 pages, and the values of poiID and resourceId are irregular, so we need to view them one by one. I found the values of all the scenic spots in Gulangyu in turn, and stored them in the text, which was shared by github at the end of the article.

Build a database

the first thing we need to do is to figure out the structure of the database. I chose to use mysql. The specific structure is as follows:

Get data

I won't make a specific analysis of , and it's not difficult, but there are a few holes to pay attention to.

First, not all comments have ratings such as scenery, performance-to-price ratio, so add a judgment here. Second, there used to be travel time, but now there seems to be no more. Third, the comment text may appear in single quotation marks, insert into the database will make an error, to escape or replace it. Fourth, the grasping speed is not too fast, Ctrip anti-scraping is still quite fierce.

Ant honeycomb analysis data

similarly, the data of the ant cell is also dynamically loaded, and the analysis data interface is viewed in the same way.

We can see that the data acquisition method of the ant honeycomb is get, and we can find out the rule of the requested url. After comparing the data of different scenic spots and different pages, we find that the parameters change mainly in two places, one is poiid I use href instead, the other is the number of pages I use num instead. To get the comment data of the scenic spot, just change these two values.

Url=' http://pagelet.mafengwo.cn/poi/pagelet/poiCommentListApi?callback=jQuery18105332634542482972_1511924148475 ms=%7B%22poi_id%22%3A%22 {href}% 22%2C%22page%22%3A {num}% 2C% 22justinitiate% 22% 3A1% 7D'

Get the poi of each scenic spot

This is not a post request, so we don't have to get parameters from each scenic spot. We can visit this site to find all the users. However, the data of this site is also loaded dynamically.

According to the picture above, we can clearly see that we only need to pass in the page number to get the poiid of all the scenic spots, and then according to these poiid, we can get all the comment data. We use a function to deal with this part.

Def get_param (): # get the parameters of all scenic spots total = [] router_url = 'http://www.mafengwo.cn/ajax/router.php' for num in range (1,6): params = {' sAct': 'KMdd_StructWebAjax | GetPoisByTag',' iMddid': 12522, 'iTagId': 0 'iPage': num} pos = requests.post (url=router_url, data=params, headers=headers). Json () soup_pos = BeautifulSoup (pos [' data'] ['list'],' lxml') result = [{'scenery': p [' title'], 'href': re.findall (r'/poi/ (\ d +) .html') P ['href']) [0]} for p in soup_pos.find_all (' a')] total.extend (result) return total

The rest of the is similar and is not overstated.

Personal blog

8aoy1.cn

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.