Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python get Amazon's comment information and process it?

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how Python gets Amazon comment information and processes it". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "Python how to get Amazon comment information and process it"!

I. Analyze Amazon's review request

First open Network in Developer mode, Clear the screen and make a request:

You'll find that the get request in Doc has exactly the comment information we want.

However, the real comment data is not all here. Scroll down the page and there is a button to turn the page:

Click the page to request the next page, and a new request is added to the Fetch/XHR tab. There is no new get request in the Doc tab. This found that all comment messages were XHR type requests.

获取到post请求的链接和payload数据,里面含有控制翻页的参数,真正的评论请求已经找到了。

这一堆就是未处理的信息,这些请求未处理的信息里面,带有data-hook=\"review\"的就是带有评论的信息。分析完毕,下面开始一步一步去写请求。

二、获取亚马逊评论的内容

首先拼凑请求所需的post参数,请求链接,以便之后的自动翻页,然后带参数post请求链接:

headers = { 'authority': 'www.amazon.it', "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",} page = 1post_data = { "sortBy": "recent", "reviewerType": "all_reviews", "formatType": "", "mediaType": "", "filterByStar": "", "filterByLanguage": "", "filterByKeyword": "", "shouldAppend": "undefined", "deviceType": "desktop", "canShowIntHeader": "undefined", "pageSize": "10", "asin": "B08GHGTGQ2",}# 翻页关键payload参数赋值post_data["pageNumber"] = page,post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{page}",post_data["scope"] = f"reviewsAjax{page}",# 翻页链接赋值spiderurl=f'https://www.amazon.it/hz/reviewsrender/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{page}'res = requests.post(spiderurl,headers=headers,data=post_data)if res and res.status_code == 200: res = res.content.decode('utf-8') print(res)

现在已经获取到了这一堆未处理的信息,接下来开始对这些数据进行处理。

三、亚马逊评论信息的处理

上图的信息会发现,每一段的信息都由"&&&"进行分隔,而分隔之后的每一条信息都是由'","'分隔开的:

所以用python的split方法进行处理,把字符串分隔成list列表:

# 返回值字符串处理contents = res.split('&&&')for content in contents: infos = content.split('","')

由'","'分隔的数据通过split处理生成新的list列表,评论内容是列表的最后一个元素,去掉里面的"\","\n"和多余的符号,就可以通过css/xpath选择其进行处理了:

for content in contents: infos = content.split('","') info = infos[-1].replace('"]','').replace('\\n','').replace('\\','') # 评论内容判断 if 'data-hook="review"' in info: sel = Selector(text=info) data = {} data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() #用户名 data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() #评分 data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() #日期地址 data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() #评价标题 data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() #评价内容 image = sel.xpath('div[@class="review-image-tile-section"]').extract_first() data['image'] = image if image else "not image" #图片 print(data)四、代码整合4.1 代理设置

稳定的IP代理是你数据获取最有力的工具。目前国内还是无法稳定的访问亚马逊,会出现连接失败的情况。我这里使用的ipidea代理请求的意大利地区的亚马逊,可以通过账密和api获取代理,速度还是非常稳定的。

地址:http://www.ipidea.net/?utm-source=csdn&utm-keyword=?wb

下面的代理获取的方法:

# api获取ip def getApiIp(self): # 获取且仅获取一个ip------意大利 api_url = '获取代理地址' res = requests.get(api_url, timeout=5) try: if res.status_code == 200: api_data = res.json()['data'][0] proxies = { 'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']), 'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']), } print(proxies) return proxies else: print('获取失败') except: print('获取失败')4.2 while循环翻页

while循环进行翻页,评论最大页数是99页,99页之后就break跳出while循环:

def getPLPage(self): while True: # 翻页关键payload参数赋值 self.post_data["pageNumber"]= self.page, self.post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{self.page}", self.post_data["scope"] = f"reviewsAjax{self.page}", # 翻页链接赋值 spiderurl = f'https://www.amazon.it/hz/reviews-render/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{self.page}' res = self.getRes(spiderurl,self.headers,'',self.post_data,'POST',check)#自己封装的请求方法 if res: res = res.content.decode('utf-8') # 返回值字符串处理 contents = res.split('&&&') for content in contents: infos = content.split('","') info = infos[-1].replace('"]','').replace('\\n','').replace('\\','') # 评论内容判断 if 'data-hook="review"' in info: sel = Selector(text=info) data = {} data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() #用户名 data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() #评分 data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() #日期地址 data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() #评价标题 data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() #评价内容 image = sel.xpath('div[@class="review-image-tile-section"]').extract_first() data['image'] = image if image else "not image" #图片 print(data) if self.page

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report