How to use Python crawler to analyze App 07/12 Update SLTechnology News&Howtos

How to use Python crawler to analyze App

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use Python crawler to analyze App". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use Python crawler to analyze App".

1 analyze the background

Before we used Scrapy to crawl and analyze Kuan 6000 + App, why is this article talking about catching App again?

If you are still confused in the world of programming, you can join our Python Learning qun:784758214 to see how our predecessors learn! Exchange experiences!

I am a senior python development engineer, from basic python scripts to web development, crawlers, django, data mining, etc., zero foundation to project actual combat materials have been sorted out. To every little friend of python! Share some learning methods and small details that need to be paid attention to. Here is the gathering place for python learners.

Because I like to mess with App, . Of course, it is mainly because of the following points:

First, the web page crawled before is very simple.

When crawling Kuan.com, we use the for loop to crawl all the content after traversing hundreds of pages, which is very simple, but the reality is often not so easy. Sometimes the content we have to grab is relatively large, such as grabbing the data of the whole website. In order to enhance the crawler skills, this paper chooses the website Pea Pod.

The goal is to crawl App information under all categories of the site and download App icons, which is about 70000, an order of magnitude higher than Kuan.

Second, practice using the powerful Scrapy framework again

I only initially used Scrapy for crawling, but I haven't fully understood how powerful Scrapy is, so this article tries to use Scrapy in depth, adding random UserAgent, proxy IP, and image download settings.

Third, compare the two websites of Kuan and Pea Pod

I believe a lot of people are using pea pods to download App, while I use Kuan more, so I would also like to compare the similarities and differences between the two sites.

Without saying much, let's start the crawling process.

▌ analysis target

First of all, let's find out what the target web page looks like.

You can see that the App on the site is divided into many categories, including "Application playback", "system tools" and so on. There are 14 major categories, and each category is subdivided into several small categories. For example, audio and video broadcasting includes: "video", "live broadcast" and so on.

Click "Video" to enter the second-level subcategory page, you can see some of the information about each App, including: icon, name, number of installations, volume, comments, etc.

Then, we can go to the third-level page, that is, the details page of each App, you can see that the number of downloads, praise rate, the number of comments of these parameters, crawl ideas and the second-level page is more or less the same, at the same time, in order to reduce the pressure on the site, so the App details page will not crawl.

Image

Therefore, this is a grabbing problem of classified multi-level pages, grabbing all the subcategory data under each category in turn.

Learned this crawl idea, a lot of websites we can grasp, such as many people like to climb the "Douban movie" is also such a structure.

▌ analysis content

After the completion of data capture, this paper mainly makes a simple exploratory analysis of classified data, including the following aspects:

Overall ranking of App with the most / least downloads

App category / subcategory ranking with the most / least downloads

Interval distribution of App downloads

How many App names have the same name

Compare it with Kuan App.

▌ analysis tool

Python

Scrapy

MongoDB

Pyecharts

Matplotlib

2 Analysis of data crawling ▌ website

We have just conducted a preliminary analysis of the website, the general idea can be divided into two steps, the first is to extract all subcategories of URL links, and then grab each URL under the App information on the line.

As you can see, the URL of a subclass is made up of two numbers, the front number represents the classification number, and the following number represents the subcategory number. If you get these two numbers, you can grab all the App information under the classification, so how do you get these two numerical codes?

When you go back to the classification page, locate and view the information, you can see that the classification information is wrapped in each li node, and the sub-category URL is in the href attribute of the sub-node a. There are 14 major categories and 88 sub-categories.

At this point, the train of thought is very clear. We can use CSS to extract the URL of the whole subcategory, and then grab the required information separately.

In addition, it should be noted that the home page information of the site is statically loaded, and Ajax dynamic loading is used from page 2. URL is different, which needs to be parsed and extracted separately.

▌ Scrapy crawl

We need to crawl two parts of content, one is the data information of APP, including the name, number of installations, volume, comments, etc., and the other is to download the icon of each App and store it in a folder.

Since the site has some anti-crawling measures, we need to add random UA and proxy IP

Here the random UA uses the * * scrapy-fake-useragent * * library, which can be done with one line of code. The agent IP goes directly to Abu Cloud to pay for the agent, which is easy and easy to do for a few yuan.

Next, let's go straight to the code.

Items.py 1import scrapy 2 3class WandoujiaItem (scrapy.Item): 4 cate_name = scrapy.Field () # category name 5 child_cate_name = scrapy.Field () # category number 6 app_name = scrapy.Field () # subcategory name 7 install = scrapy.Field () # subcategory number 8 volume = scrapy.Field () # Volume 9 comment = scrapy.Field () # comment 10 icon _ url = scrapy.Field () # icon urlPython resource sharing qun 784758214 Installation package is included PDF, learning video, this is the gathering place for Python learners, zero basic, advanced, all welcome middles.py

Middleware is mainly used to set up proxy IP.

1import base64 2proxyServer = "http://http-dyn.abuyun.com:9020" 3proxyUser =" your message "4proxyPass =" your message "5 6proxyAuth =" Basic "+ base64.urlsafe_b64encode (bytes ((proxyUser +": "+ proxyPass)," ascii "). Decode (" utf8 ") 7class AbuyunProxyMiddleware (object): 8 def process_request (self, request) Spider): 9 request.meta ["proxy"] = proxyServer10 request.headers ["Proxy-Authorization"] = proxyAuth21 logging.debug ('Using Proxy:%s'%proxyServer) pipelines.py

This file is used to store data to MongoDB and download icons to category folders.

Save to MongoDB:

1MongoDB storage 2class MongoPipeline (object): 3 def _ init__ (self,mongo_url,mongo_db): 4 self.mongo_url = mongo_url 5 self.mongo_db = mongo_db 6 7 @ classmethod 8 def from_crawler (cls,crawler): 9 return cls (10 mongo_url = crawler.settings.get ('MONGO_URL')) 11 mongo_db = crawler.settings.get ('MONGO_DB') 12) 1314 def open_spider (self,spider): 15 self.client = pymongo.MongoClient (self.mongo_url) 16 self.db = self.client [self.mongo _ db] 1718 def process_item (self,item Spider): 19 name = item.__class__.__name__20 # self.db.insert (dict (item)) 21 self.db.update _ one (item, {'$set': item}, upsert=True) 22 return item2324 def close_spider (self,spider): 25 self.client.close ()

Download icons by folder:

Download 2class ImagedownloadPipeline (ImagesPipeline): 3 def get_media_requests (self,item,info): 4 if item ['icon_url']: 5 yield scrapy.Request (item [' icon_url'], meta= {'item':item}) 6 7 def file_path (self, request, response=None) Info=None): 8 name = request.meta ['item'] [' app_name'] 9 cate_name = request.meta ['item'] [' cate_name'] 10 child_cate_name = request.meta ['item'] [' child_cate_name'] 1112 path2 = r% s'% (cate_name,child_cate_name) 13 path = r'{}. {} '.format (path2, name) 'jpg') 14 return path2516 def item_completed (self,results,item,info): 17 image_path = [x [' path'] for ok X in results if ok] 18 if not image_path:19 raise DropItem ('Item contains no images') 20 return itemsettings.py 1BOT_NAME =' wandoujia' 2SPIDER_MODULES = ['wandoujia.spiders'] 3NEWSPIDER_MODULE =' wandoujia.spiders' 4 5MONGO_URL = 'localhost' 6MONGO_DB =' wandoujia' 7 "whether to follow the robot rules 9ROBOTSTXT_OBEY = False10# download settings delay due to the purchase of Abu Cloud can only request 5 requests per second So each request has a 0.2s delay 11DOWNLOAD_DELAY = 0.21213DOWNLOADER_MIDDLEWARES = {14 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,15' scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 100,# Random UA16 'wandoujia.middlewares.AbuyunProxyMiddleware': 200 # Abu Cloud Agent 17) 1819ITEM_PIPELINES = {20' wandoujia.pipelines.MongoPipeline': 300 21 'wandoujia.pipelines.ImagedownloadPipeline': 400 22} 232s URL does not repeat 25DUPEFILTER_CLASS =' scrapy.dupefilters.BaseDupeFilter'wandou.py

The key parts of the main program are listed here:

1def _ _ init__ (self): 2 self.cate_url = 'https://www.wandoujia.com/category/app' 3 # subcategory home page url 4 self.url =' https://www.wandoujia.com/category/' 5 # subcategory ajax request page url 6 self.ajax_url = 'https://www.wandoujia.com/wdjweb/api/category/more?' 7 # instantiate category tag 8 self.wandou_category = Get_category () 9def start_requests (self): 10 yield scrapy.Request (self.cate_url) Callback=self.get_category) 1112def get_category (self,response): 13 cate_content = self.wandou_category.parse_category (response) 14 #.

Here, we first define several URL, including the classification page, the subcategory home page, the subcategory AJAX page, that is, the URL at the beginning of page 2, and then define a class Get_category () that is specifically used to extract all the subcategory URL, and we will expand the code for this class later.

The program starts from start_requests, parses the home page to get the response, calls the get_category () method, and then uses the parse_category () method in the Get_category () class to extract all the URL, as follows:

1class Get_category (): 2 def parse_category (self, response): 3 category = response.css ('.parent-cate') 4 data = [{5' cate_name': item.css ('.cate-link::text') .extract_first (), 6' cate_code': self.get_category_code (item), 7 'child_cate_codes': self.get_child_category (item) 8} for item in category] 9 return data1011 # get all main classification tag numeric codes 12 def get_category_code (self, item): 13 cate_url = item.css ('.cate-link::attr ("href")'). Extract_first () 14 pattern = re.compile (ringing / (\ d+)') # extract main class tag code 15 cate_code = re.search (pattern Cate_url) 16 return cate_code.group (1) 1718 # get all subcategory names and codes 19 def get_child_category (self, item): 20 child_cate = item.css ('.child-cate a') 21 child_cate_url = [{22' child_cate_name': child.css (':: text') .extract_first () 23 'child_cate_code': self.get_child_category_code (child) 24} for child in child_cate] 25 return child_cate_url2627 # regular extraction subclassification code 28 def get_child_category_code (self Child): 29 child_cate_url = child.css (':: attr ("href")'). Extract_first () 30 pattern = re.compile (rus.tag _ (\ d +)') # extract subclass tag no. 31 child_cate_code = re.search (pattern, child_cate_url) 32 return child_cate_code.group (1) Python resource sharing qun 784758214, with installation package included PDF, learning video, this is the gathering place for Python learners, zero basic, advanced, welcome

Here, in addition to the classification name cate_name can be easily extracted directly, classification coding and subclassification names and codes, we use get_category_code () and other three methods to extract. The extraction method uses CSS and regular expressions, which is relatively simple.

The final extracted classification name and coding result are as follows, and using these codes, we can construct a URL request to start extracting App information under each subcategory.

1 {'cate_name':' video playback', 'cate_code':' 5029video, 'child_cate_codes': [2 {' child_cate_name': 'video', 'child_cate_code':' 716'}, 3 {'child_cate_name':' live broadcast', 'child_cate_code':' 1006'}, 4... 5]} 6 {'cate_name':' system tools', 'cate_code':' 5018', 'child_cate_codes': [7 {' child_cate_name': 'WiFi',' child_cate_code': '895'}, 8 {'child_cate_name':' browser', 'child_cate_code':' 599'}, 9... 10]}, 11.

Then the previous get_category () continues to write down to extract the information about App:

1def get_category (self Response): 2 cate_content = self.wandou_category.parse_category (response) 3 #... 4 for item in cate_content: 5 child_cate = item ['child_cate_codes'] 6 for cate in child_cate: 7 cate_code = item [' cate_code'] 8 cate_name = item ['cate_name'] 9 Child_cate_code = cate ['child_cate_code'] 10 child_cate_name = cate [' child_cate_name'] 1112 page = 1 # set the number of pages to start crawling 13 if page = = 1:14 # Construction Home url15 category_url ='{} {} _ {} '.format (self.url Cate_code, child_cate_code) 16 else:17 params = {18 'catId': cate_code, # category 19' subCatId': child_cate_code, # subcategory 20 'page': page 21} 22 category_url = self.ajax_url + urlencode (params) 23 dict = {'page':page,'cate_name':cate_name,'cate_code':cate_code,'child_cate_name':child_cate_name,'child_cate_code':child_cate_code} 24 yield scrapy.Request (category_url,callback=self.parse,meta=dict)

Here, all the classification names and codes are extracted in turn, which are used to construct the requested URL.

Because the URL on the front page is different from the URL at the beginning of page 2, the if statement is used to construct it respectively. Next, the URL is requested and then parsed by calling the self.parse () method, where the meta parameter is used to pass the relevant parameters.

1def parse (self, response): 2 if len (response.body) > = 100: # determine whether the page has finished climbing The value is set at 100 because the length without content is 87 3 page = response.meta ['page'] 4 cate_name = response.meta [' cate_name'] 5 cate_code = response.meta ['cate_code'] 6 child_cate_name = response.meta [' child_cate_name'] 7 child_cate_code = response.meta ['child_cate_code'] 8 9 if page = = 1:10 contents = response11 else:12 jsonresponse = json.loads (response.body_as_unicode ()) 13 contents = jsonresponse ['data'] [' content'] 14 # response is json Json content is html Html for text cannot be extracted directly using .css. To first convert 15 contents = scrapy.Selector (text=contents Type= "html") 1617 contents = contents.css ('.card') 18 for content in contents:19 # num + = 120 item = WandoujiaItem () 21 item ['cate_name'] = cate_name22 item [' child_cate_name'] = child_cate_name23 item ['app_name'] = self.clean _ name (content.css ('.name:: text') .extract_first () 24 item [' install'] = content.css ('.install-count::text') .extract_first () 25 item [' volume'] = content.css ('.meta span:last-child::text') .extract_first () 26 item [' comment'] = content.css ('.comment::) Text') .extract_first () .strip () 27 item ['icon_url'] = self.get_icon_url (content.css (' .icon-wrap an img') Page) 28 yield item2930 # Recursive climb next page 31 page + = 132 params = {33 'catId': cate_code, # large category 34' subCatId': child_cate_code, # small category 35 'page': page 36} 37 ajax_url = self.ajax_url + urlencode (params) 38 dict = {'page':page,'cate_name':cate_name,'cate_code':cate_code,'child_cate_name':child_cate_name,'child_cate_code':child_cate_code} 39 yield scrapy.Request (ajax_url,callback=self.parse,meta=dict)

Finally, the parse () method is used to parse and extract the final App name, installation quantity and other information we need. After a page is parsed, page increments it, and then repeatedly calls the parse () method to loop parsing until the last page of all categories is parsed.

Finally, in a few hours, we can grab all the App information. I get 73755 messages and 72150 icons here, and the two values are different because some App have only information but no icons.

Icon download:

The following will be a simple exploratory analysis of the extracted information.

3 the overall situation of ▌ based on data analysis

First of all, let's take a look at the number of App installations. after all, more than 70000 App models are naturally interested in which App is used the most and which is the least.

The code is implemented as follows:

1plt.style.use ('ggplot') 2colors =' # 6D6D6D' # Font Color 3colorline ='# 63AB47' # Red CC2824 # Pea pod Green 4fontsize_title = 20 5fontsize_text = 106 "overall ranking 8def analysis_maxmin (data): 9 data_max = (data [: 10]) .sort _ values (by='install_count') 10 data_max ['install_count'] = (data_max [' install_count'] / 100000000) .round (1) 11 data_max.plot.barh 12 for y, x in enumerate (list ((data_max ['install_count'])): 13 plt.text (x + 0.1,y-0.08,'% s' round (x, 1), ha='center', color=colors) 1516 plt.title ('the top 10 App installed?' Color=colors) 17 plt.xlabel ('100 million downloads') 18 plt.ylabel ('App') 19 plt.tight_layout () 20 # plt.savefig (' most installed App.png',dpi=200) 21 plt.show ()

Looking at the picture above, there are two "unexpected":

At the top of the list is a mobile phone management software

I was surprised by the first place on the pea pod net. First, I wonder if everyone loves cell phone cleaning or is afraid of poisoning. After all, my own mobile phone has been streaking for years; second, the first place is not other products from the goose factory, such as Wechat or QQ.

When I looked at the list, what I thought would appear did not appear, but what I did not expect appeared.

In the top 10, there are relatively rarely heard of names such as book flag novels and prints, while national App Wechat and Alipay do not even appear in this list.

With doubt and curiosity, I found the home pages of "Tencent Mobile Manager" and "Wechat" App respectively:

Tencent Mobile Manager download and installation quantity:

Number of Wechat downloads and installations:

What's going on?

More than 300 million downloads of Tencent WAF are equivalent to installations, while Wechat's more than 2 billion downloads have only more than 10 million installations. the comparison of the two sets of data roughly reflects two problems:

Or the downloads of Tencent's butler are actually not that large.

Either the downloads of Wechat are less.

No matter which problem it is, it reflects a problem: the site is not doing enough.

To prove this point, comparing the installation and downloads of the top 10, it is found that many App installations and downloads are the same, that is, the actual installation of these App is not that large, and if so, then this list has a lot of water.

Is it possible to get such a result after working so hard for such a long time?

Don't give up, and then take a look at the least installed App, here are 10 of them:

After a glance, I didn't expect it even more:

"QQ Music" is the last but one, with only 3 installments!

Is this the same product as QQ Music, who has just gone public and has a market capitalization of hundreds of billions?

It was verified again:

Read it correctly, it says 3 people to install it!

To what extent is this already a matter of mind? With this amount of installation, can the goose factory "make good music with your heart"?

To tell you the truth, I don't want to analyze any more here. I'm worried about crawling out more unexpected things, but after working so hard for such a long time, let's take a look down.

After looking at the beginning and end, let's look at the whole to understand the distribution of the installed quantity of all App, except for the top ten App with a lot of moisture.

I was surprised to find that there are as many as 67195 models, accounting for 94% of the total number of App installed less than 10, 000!

If all the data on this site are true, then Mobile Manager, who ranks first above, is almost equal to the installation of more than 60,000 App models!

For most App developers, we can only say: the reality is very cruel, and the probability that the hard-working App has no more than 10, 000 users is as high as 95%.

The code is implemented as follows:

1def analysis_distribution (data): 2 data = data.loc [10data:] 3 data ['install_count'] = data [' install_count'] .apply (lambda x:x/10000) 4 bins = [0pd.cut ['install_count'], bins] 6 cats = pd.cut (data [' install_count'], bins) 6 cats = Labels=group_names) 7 cats = pd.value_counts (cats) 8 bar = Bar ('App download Distribution','up to 94% of App downloads less than 10, 000') 9 bar.use_theme ('macarons') 10 bar.add (11' App quantity', 12 list (cats.index), 13 list (cats.values), 14 is_label_show = True 15 xaxis_interval = 0jue 16 is_splitline_show = 0je 17) 18 bar.render (path='App download quantity distribution. PNG', pixel_ration=1) ▌ classification

Next, let's take a look at the App under each category, no longer looking at the number of installations, but at the number, in order to eliminate interference.

We can see that among the 14 categories, there is little difference in the number of App in each category, and the largest number of "life and leisure" is a little more than twice that of "photographic images".

Next, let's take a closer look at the number of App in 88 subcategories and select the 10 subcategories with the largest number and the lowest number:

Two interesting phenomena can be found:

"Radio" category has the largest number of App, reaching more than 1300 models.

This is very unexpected, at present, the radio can be said to be an old antique, but there are still so many people to develop it.

There is a large gap in the number of App subclasses.

The most "radio" is nearly 20 times the least "dynamic wallpaper". If I am an App developer, I would prefer to try to develop niche App with less competition, such as "memorizing words" and "pediatric encyclopedia".

After looking at the overall and classified situation, I suddenly thought of a question: there are so many App, are there any duplicate names?

I was surprised to find that there are as many as 40 App called "one-button lock screen". Is it difficult for App to come up with another name? Now many mobile phones support touch lock screen, which is more convenient than one button lock screen operation.

Next, let's briefly compare the App situation of the two websites Guidou and Kuan.

▌ is better than Kuan.

One of the most intuitive differences between the two is that pea pods have an absolute advantage in terms of App, up to ten times that of Kuan, so we are naturally interested:

Does the pea pod include all the App on Cool?

If it is, "I have everything you have, and I have what you don't have", then Kuan has no advantage. After statistics, it is found that only 3018 pea pods are included, that is, about half, while the other half is not included.

It is true that there is a discrepancy in the name of App between the two platforms, but there is more reason to believe that many niche boutique App in Kuan are unique, but not in pea pods.

The code is implemented as follows:

1include = data3.shape [0] 2notinclude = data2.shape [0]-data3.shape [0] 3sizes = [include,notinclude] 4labels = [u' inclusive', u' not included'] 5explode = [0pime0.05] 6plt.pie (7 sizes, 8 autopct ='% .1f%%', 9 labels = labels,10 colors = [colorline,'#7FC161'], # Pea pod green 11 shadow = False,12 startangle = 90e 13 explode = explode,14 textprops = {'fontsize':14 'color':colors} 15) 16plt.title (' pea pods include only half of the number of App on Kuan', color=colorline,fontsize=16) 17plt.axis ('equal') 18plt.axis (' off') 19plt.tight_layout () 20plt.savefig ('not guaranteed to contain comparisons. PNG', dpi=200) 21plt.show () Python resource sharing qun 784758214 with installation package PDF, learning video, this is the gathering place for Python learners, zero basic, advanced, welcome

Next, let's take a look at the number of downloads on the two platforms in the included App:

As you can see, the difference in the number of App downloads between the two platforms is still obvious.

Finally, I'll see what APP is not included on the pea pod:

Thank you for your reading, the above is the content of "how to use Python crawler to analyze App". After the study of this article, I believe you have a deeper understanding of how to use Python crawler to analyze App, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.