In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
I have long sold a large amount of Weibo data, travel website review data, and provide a variety of designated data crawling services, Message to YuboonaZhang@Yahoo.com. Also welcome to join the social media data exchange group: 99918768
Preface
In order to obtain multi-source data, needs to go to various websites to obtain comment information and pictures of some scenic spots. First of all, Ctrip and Maybee Cave are selected to record some crawling process.
Ctrip analysis data
first, let's go to Ctrip's Gulangyu scenic spot to take a quick look at the page we want to climb. We probably found that there are dozens of scenic spots, and the structure of each scenic spot should be similar, so we chose the first scenic spot to go in and see how the specific page should be crawled.
What we need is the part of the red circle, it is easy to know that the comment page is dynamically loaded, so we can not directly use bs4 or regular to extract elements directly, we need to analyze the interface of page dynamic transmission. Open the chrome review element, switch to network to view the transmitted content, first clear the content to avoid interference, and then click on the next page, we can get
By looking at the returned data, we can get that this is the interface we want, using post for transmission, and the transmitted Form Data has a lot of fields, which can be roughly guessed.
PoiID is the scenic spot poiID pagenow is the current number of pages star is a score of 1-5 star 0 represents all resourceId is a corresponding value for each resource
When crawling, you only need to change these values to crawl content according to your own needs, but the things you need to pay attention to is that Ctrip's pagenow can only get up to 100 pages, and the values of poiID and resourceId are irregular, so we need to view them one by one. I found the values of all the scenic spots in Gulangyu in turn, and stored them in the text, which was shared by github at the end of the article.
Build a database
the first thing we need to do is to figure out the structure of the database. I chose to use mysql. The specific structure is as follows:
Get data
I won't make a specific analysis of , and it's not difficult, but there are a few holes to pay attention to.
First, not all comments have ratings such as scenery, performance-to-price ratio, so add a judgment here. Second, there used to be travel time, but now there seems to be no more. Third, the comment text may appear in single quotation marks, insert into the database will make an error, to escape or replace it. Fourth, the grasping speed is not too fast, Ctrip anti-scraping is still quite fierce.
Ant honeycomb analysis data
similarly, the data of the ant cell is also dynamically loaded, and the analysis data interface is viewed in the same way.
We can see that the data acquisition method of the ant honeycomb is get, and we can find out the rule of the requested url. After comparing the data of different scenic spots and different pages, we find that the parameters change mainly in two places, one is poiid I use href instead, the other is the number of pages I use num instead. To get the comment data of the scenic spot, just change these two values.
Url=' http://pagelet.mafengwo.cn/poi/pagelet/poiCommentListApi?callback=jQuery18105332634542482972_1511924148475 ms=%7B%22poi_id%22%3A%22 {href}% 22%2C%22page%22%3A {num}% 2C% 22justinitiate% 22% 3A1% 7D'
Get the poi of each scenic spot
This is not a post request, so we don't have to get parameters from each scenic spot. We can visit this site to find all the users. However, the data of this site is also loaded dynamically.
According to the picture above, we can clearly see that we only need to pass in the page number to get the poiid of all the scenic spots, and then according to these poiid, we can get all the comment data. We use a function to deal with this part.
Def get_param (): # get the parameters of all scenic spots total = [] router_url = 'http://www.mafengwo.cn/ajax/router.php' for num in range (1,6): params = {' sAct': 'KMdd_StructWebAjax | GetPoisByTag',' iMddid': 12522, 'iTagId': 0 'iPage': num} pos = requests.post (url=router_url, data=params, headers=headers). Json () soup_pos = BeautifulSoup (pos [' data'] ['list'],' lxml') result = [{'scenery': p [' title'], 'href': re.findall (r'/poi/ (\ d +) .html') P ['href']) [0]} for p in soup_pos.find_all (' a')] total.extend (result) return total
The rest of the is similar and is not overstated.
Personal blog
8aoy1.cn
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.