Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to explain Python crawling ajax dynamically generated data by crawling Taobao comments as an example

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

How to crawl Taobao comments as an example to explain Python crawling ajax dynamically generated data, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

When learning python, you must encounter the situation that the website content is dynamically requested by ajax and the generated json data is refreshed asynchronously, and the way of crawling static web content before using python can not be realized, so this article will talk about how to crawl the data generated dynamically by ajax in python.

Here we take crawling Taobao comments as an example to explain how to do it.

Step 1:

When getting comments on Taobao, ajax asked for a link (url). Here I use the Chrome browser to do this. Open the Taobao link and search for an item in the search box, such as "shoes". Here we select the first item.

And then jumped to a new web page. Here, since we need to crawl the user's comments, we click on the cumulative evaluation.

Then we can see the user's evaluation of the product, when we right-click to select the review element in the web page (or open it directly with F12) and select the Network option, as shown in the figure:

We go to the bottom of the user comments and click on the next page or the second page, and we see that several items have been dynamically added in Network, and we choose one that starts with list_detail_rate.htm?itemId=35648967399.

Then click on this option, and we can see the information about the link in the option box on the right. We want to copy the link content in Request URL.

We enter the url link we just got in the browser's address bar, and when we open it, we will find that the page returns the data we need, but it is messy because it is json data.

Second, get the json data returned by the ajax request

Next, we will get the json data in url. The python editor I use is pycharm. Let's take a look at the python code:

#-*-coding: utf-8-*-import sysreload (sys) sys.setdefaultencoding ('utf-8') import requestsurl=' https://rate.tmall.com/list_detail_rate.htm?itemId=35648967399&spuId=226460655&sellerId=1809124267 ℴ= 3¤tPage=1&append=0&content=1&tagId=&posi=&picture=&ua=011UW5TcyMNYQwiAiwQRHhBfEF8QXtHcklnMWc%3D%7CUm5OcktyT3ZCf0B9Qn9GeC4%3D%7CU2xMHDJ7G2AHYg8hAS8WKAYmCFQ1Uz9YJlxyJHI%3D%7CVGhXd1llXGVYYVVoV2pVaFFvWGVHe0Z%2FRHFMeUB4QHxCdkh8SXJcCg%3D%3D%7CVWldfS0RMQ47ASEdJwcpSDdNPm4LNBA7RiJLDXIJZBk3YTc%3D%7CVmhIGCUFOBgkGiMXNwswCzALKxcpEikJMwg9HSEfJB8%2FBToPWQ8%3D%7CV29PHzEfP29VbFZ2SnBKdiAAPR0zHT0BOQI8A1UD%7CWGFBET8RMQszDy8QLxUuDjIJNQA1YzU%3D%7CWWBAED4QMAU%2BASEYLBksDDAEOgA1YzU%3D%7CWmJCEjwSMmJXb1d3T3JMc1NmWGJAeFhmW2JCfEZmWGw6GicHKQcnGCUdIBpMGg%3D%3D%7CW2JfYkJ%2FX2BAfEV5WWdfZUV8XGBUdEBgVXVJciQ % 3Droomisgcalls 82B6A3A1ED52A6996BCA2111C9DAAEE6TS14404902226982142callback jsonp2143' # where the url is longer than content=requests.get (url) .content

What print content # prints out is the json data we got on the web page before. Including user comments.

The content here is the json data we need, and the next step is to parse the json data.

Use python to parse json data

#-*-coding: utf-8-*-import sysreload (sys) sys.setdefaultencoding ('utf-8') import requestsimport jsonimport reurl=' https://rate.tmall.com/list_detail_rate.htm?itemId=35648967399&spuId=226460655&sellerId=1809124267 ℴ= 3¤tPage=1&append=0&content=1&tagId=&posi=&picture=&ua=011UW5TcyMNYQwiAiwQRHhBfEF8QXtHcklnMWc%3D%7CUm5OcktyT3ZCf0B9Qn9GeC4%3D%7CU2xMHDJ7G2AHYg8hAS8WKAYmCFQ1Uz9YJlxyJHI%3D%7CVGhXd1llXGVYYVVoV2pVaFFvWGVHe0Z%2FRHFMeUB4QHxCdkh8SXJcCg%3D%3D%7CVWldfS0RMQ47ASEdJwcpSDdNPm4LNBA7RiJLDXIJZBk3YTc%3D%7CVmhIGCUFOBgkGiMXNwswCzALKxcpEikJMwg9HSEfJB8%2FBToPWQ8%3D%7CV29PHzEfP29VbFZ2SnBKdiAAPR0zHT0BOQI8A1UD%7CWGFBET8RMQszDy8QLxUuDjIJNQA1YzU%3D%7CWWBAED4QMAU%2BASEYLBksDDAEOgA1YzU%3D%7CWmJCEjwSMmJXb1d3T3JMc1NmWGJAeFhmW2JCfEZmWGw6GicHKQcnGCUdIBpMGg%3D%3D%7CW2JfYkJ % 2FX2BAfEV5WWdfZUV8XGBUdEBgVXVJciQ%3D&isg=82B6A3A1ED52A6996BCA2111C9DAAEE6&_ksTS=1440490222698_2142&callback=jsonp2143'cont=requests.get (url) .contentrex = re.compile (r'\ w + [(] {1} (. *) [)] {1}') content=rex.findall (cont) [0] con=json.loads (content "gbk") count=len (con ['rateDetail'] [' rateList']) for i in xrange (count): print con ['rateDetail'] [' rateList'] [I] ['appendComment'] [' content']

Parsing:

Here, you need to import the package you want. Re is the package required by regular expressions, and import json is needed to parse json data.

Cont=requests.get (url). Content # gets json data in a web page

Rex=re.compile (r'\ w + [(] {1} (. *) [)] {1}') # regular expression removes the redundant part of the cont data so that the data becomes the data in true json format {"a": "b", "c": "d"}

Con=json.loads (content, "gbk") uses the loads function of json to convert the content content into a data format that can be handled by the json library function. "gbk" is the data encoding method, because the win system defaults to gbk.

Count=len (con ['rateDetail'] [' rateList']) # gets the number of user comments (this is only for the current page)

For i in xrange (count):

Print con ['rateDetail'] [' rateList'] [I] ['appendComment']

# Loop through the user's comments and output (you can also save the data according to your needs, you can see part 4)

The difficulty here is to find the path of user comments in the cluttered json data.

4. Save the results of the analysis.

Here, users can save their comments locally, such as in csv format.

On how to crawl Taobao comments as an example to explain Python crawling ajax dynamically generated data questions shared here, I hope the above content can be of some help to you, if you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report