Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to crawl the game discount information on the ranking list

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

How to use Python to crawl the game discount information on the ranking, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Preface

Unwittingly, the annual steam summer promotion began quietly. Every year, through large and small promotions, my game library is full of games that have not yet been downloaded. But the so-called "buy is to earn, G fat must be a big loss" idea is becoming more and more popular, it is possible to rely on them to develop in the future.

Sometimes when you scroll through the steam rankings to see your favorite games, you will be affected by the price on the right. Over time, I found that the game I don't want to buy is not because it's not fun, but because it's not on sale yet. Or maybe some feelings have not been tapped by others, feeling sorry for themselves in the hidden corner of the list, waiting for the person who "plays with it" to appear.

So I simply used python to crawl the information of the top 10000 games on the steam list, including game name, evaluation, price, publication date, etc., and can also conduct further data analysis when selecting the games I am interested in in a more concise list interface.

Don't say much nonsense, hurry up, or I'll drag you to the end of the promotion and you won't get hot. (there was no heat either.)

Start crawling

Let's start with the pros and cons of this crawler selection data:

First, I found that when steam displays the list of rankings, the backstage will make an application for a query, click on it to see a string of json code, and there is no need to simulate a browser to fill in the "headers" form when python request. The json code obtained through access greatly simplifies the cycle complexity, and 100 game messages can be obtained in one loop.

Second, because you only need to traverse all the json code, the time can be shorter than going to each game link.

Third, but because there is no link to each game, information such as comments, profiles, developers and so on are not crawled. But there are many crawler strategies that crawl game links online, so there is no need to play axe here.

First of all, enter the ranking page of the official website, in order to avoid games DLC, bundle and other types that affect the later operation, remember to check only the game category in the filter on the right.

Through the XHR in the background, it is found that each refresh of the page only shows the top 50 games. When we scroll down the page, the website will send a mysterious code:

After observation, I found that the code will automatically request to return game information that starts with the number of start parameters and has a total number of count parameters. For example, the following figure shows that it has requested a total of 50 games from 51 to 100.

Double-click the red box link in the above image, and the returned page looks like this:

The so-called json format, in fact, is to put dictionaries or lists in dictionaries, which are saved by many big data at present. So it's actually very convenient when querying, but I still use regular expressions when extracting information, because it's much more convenient.

After knowing this, the rest can be extracted with python useful information to form a new Dataframe list so that it can be saved in csv format later.

# Import the library import requestsfrom bs4 import BeautifulSoupimport reimport jsonimport numpy as npimport pandas as pdimport matplotlib.pyplot as plt to be used

We try to open the link to the above json page with requests and parse it with json load.

Here I have changed the parameters of start and count to make it easier to check whether the information is consistent with the original web page.

Url = 'https://store.steampowered.com/search/results/?query&start=0&count=100&dynamic_data=&sort_by=_ASC&category1=998&snr=1_7_7_globaltopsellers_7&filter=globaltopsellers&infinite=1'content = requests.get (url). Contentjsontext = json.loads (content) soup = BeautifulSoup (jsontext [' results_html'], 'html.parser')

You can take a look at the results returned by soup, which shows what is returned by 'results_html' in json, because we no longer need the previous content, and all the game information is in this key.

Then let's go back to the json page and see where everything we want is hidden:

The name of the game is hidden in span's title class:

Listdate = soup.find_all ('div', class_ =' col search_released responsive_secondrow')

Similarly, you can use the above method to find the link to the game, ID, so I won't repeat it here.

The number of people scored and scored is hidden in the span tag, and it will be troublesome to look it up in a dictionary, so we'll use regular expressions to extract them later:

Unfortunately, some games have not been commented on because they are not yet on the shelves, and the information we get with regular expressions is garbled. So we use functions to prevent the possibility of garbled code:

Def get_reviewscore (review): gamereview= [] for i in range (len (review)): try: score = re.search ('br > (\ d\ d)%' Str (gamereview.append [I])) [1] except: score =''gamereview.append (score) return gamereview###def get_reviewers (review): reviewers= [] for i in range (len (review)): try: Ppl = (re.search ('the\ s (. *?) (\ s) user') Str (reviewers.append [I]) [1]) except: ppl =''reviewers.append (ppl) return reviewers

If the readers here feel relaxed, then I can go on, because crawling the price is more troublesome than commenting. But it's only troublesome, and there's no high-end operation, and I don't believe I crawled to the desired result in a smart way, because the code re-optimized for this volume of data has little difference in run time. Anyway, the result is the same. Whatever.

In fact, it is very easy to find the final price of the item (that is, the price of a free game, discounted or undiscounted game), because this is where he is hiding:

By default, the last two digits are the last two decimal places, so let's just find this string of numbers and divide them by 100:

Def get_finalprice (price): finalprice= [] for i in range (len (price)): pricelist = int (re.search ('final (\ web?) (\ d +) (\ W)', str (print [I])) [2]) / 100finalprice.append (pricelist) return finalprice

But what if we just want to know his original price so that we can analyze it later?

First, take a look at the prices on the steam list. There are three ways to display them:

The first, a discounted item with a crossed price, looks like this in the source code:

Second, free of charge:

The headache is that there are variants of the free logo:

Even the case of to is different. Steam, use your snack! )

But Free is still at the front, so all we have to do is find Free.

Third, the original price shows:

steam夏日促销悄然开始,用Python爬取排行榜上的游戏打折信息

The above pictures are all the rules and deformations I found during the spot check. In order to avoid a "rabble" in the following thousands of games, I only look for these three formats in the code. If there is grotesque data, directly hit the "null value" with one stick:

Def get_price (price): oripricelist= [] for i in range (len (price)): try: oripricelist.append (price [I] .find _ all (class_= "col search_price responsive_secondrow") [0] .text) except: oripricelist.append (price [I] .find _ all (class_= "col search_price discounted responsive_secondrow") [0] .text) ori_price= [] for i in range (len (oripricelist)): try: search = re.search ('Free' Oripricelist [I]) [0] .replace ('Free','0') except: if oripricelist [I] = ='\ ngirls: search='' else: try: search= re.search ('HK.*? (\ d +\.\ d +)\ D' Oripricelist [I]) [1] except: search='' ori_price.append (search) return ori_price

After defining the desired data, we began to run the loop.

First, give us a good name for the data we want:

Def get_data (games=1000): num_games = games gamename= [] gamereview= [] gamereviewers= [] gamerelease= [] oriprice= [] final_price= [] appid= [] website= []

Then we start the running loop at the pace of querying 100 games per link and find out the information in it and enter it into the list above:

Page = np.arange (0m numb For num in page: url = 'https://store.steampowered.com/search/results/?query&start='+str(num)+'&count=100&dynamic_data=&sort_by=_ASC&category1=998&snr=1_7_7_globaltopsellers_7&filter=globaltopsellers&infinite=1' print (' the {} iteration: Trying to connect...'.format ((num/100) + 1)) content = requests.get (url). Content jsontext = json .loads (content) soup = BeautifulSoup (jsontext ['results_html'] 'html.parser') name = soup.find_all (' span',class_ = 'title') review = soup.find_all (' div', class_ = 'col search_reviewscore responsive_secondrow') listdate = soup.find_all (' div', class_ = 'col search_released responsive_secondrow') price = soup.find_all (' div' Class_=' col search_price_discount_combined responsive_secondrow') href = soup.find_all (class_='search_result_row ds_collapse_flag') for i in name: gamename.append (i.text) getreview = get_reviewscore (review) for i in getreview: gamereview.append (I) getreviewers = get_reviewers (review) for i in Getreviewers: gamereviewers.append (I) for i in listdate: gamerelease.append (i.text) getprice = get_price (price) for i in getprice: oriprice.append (I) getfinalprice = get_finalprice (price) for i in getfinalprice: final_price.append (I) for i in range (len (href)): Appid.append (eval (soup.find_all (class_='search_result_row ds_collapse_flag') [I] .attrs ['data-ds-appid'])) website.append (soup.find_all (class_='search_result_row ds_collapse_flag') [I] .attrs [' href']) print ('done')

We ask the computer to print a word every time we visit the page and complete each loop, so that we can quickly find out the wrong page when something goes wrong.

The resulting data is then stuffed into a data table:

Df = pd.DataFrame (data= [gamename,gamereview,gamereviewers,gamerelease,oriprice,final_price,appid,website]). T df.columns = ['name','review_score','reviewers','release_date','ori_price','final_price','id','link'] return df# calls our function: df = get_data (10000) # where the number represents crawling 10000 games

Wait for a long process and appreciate the process of success:

The final dataset looks like this:

There are a total of three such mistakes in the top 1000 games:

Live football 2020 is free for demo, and it does cost HK $78 to experience the full game.

Strange Life 1 is free for the first chapter and HK $23.8 for the following chapter.

Although the code runs fast, it still gets too little information, so if you want to delve into steam data, you still need to have great patience to traverse all the game links.

In fact, this crawler experience also found some minor errors in steam's entry into a large database, such as the three variants of the free logo mentioned above, but they may not find it a problem.

This is the answer to the question about how to use Python to crawl the game discount information on the ranking. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report