In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the relevant knowledge of "Python crawler data example analysis". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Get data
Open the Douyu TV interface and click to turn the page.
Network looks at the asynchronous request XHR and finds the corresponding URL
The corresponding URL was obtained successfully.
Https://www.douyu.com/gapi/rkc/directory/0_0/2
Turn the page and change only the last number at the end.
Use requests+pyquery to crawl.
Some of the crawler codes are as follows.
Def get_datas (url): data = [] doc = get_json (url) jobs=doc ['data'] [' rl'] for job in jobs: dic = {} dic ['user_name'] = jsonpath.jsonpath (job,'$..nn') [0] # username dic [' user_id'] = jsonpath.jsonpath (job,'$..uid') [0] # user ID dic ['room_name'] = jsonpath.jsonpath (job '$.. rn') [0] # Room name dic [' room_id'] = jsonpath.jsonpath (job,'$..rid') [0] # Room ID dic ['redu'] = jsonpath.jsonpath (job,'$..ol') [0] # Hot dic [' c2name'] = jsonpath.jsonpath (job,'$..c2name') [0] # Zone dic ['time'] = stampToTime (time.time ()) data.append (dic) return data
The rest is continuous crawling, and I set it to crawl every 10 minutes.
Store the crawled data in Mysql.
# Save to Mysqlfrom sqlalchemy import create_engineengine = create_engine ('mysql+mysqldb://root:*** password * * @ localhost:3306/demo?charset=utf8mb4') final_result.to_sql (' data_douyu',con=engine, index=False, index_label=False,if_exists='append', chunksize=1000)
Crawled for more than seven days in a row, and finally got 20.62 million live data.
Data analysis
Import data into python.
De-weight, in fact, the reptile part has been set to remove weight, here to do it again for insurance, but it turns out that there is no repetition.
Because the actual crawling time is from the afternoon of 0731 to the morning of 0808, in order to facilitate later calculation, the live broadcast data of 0801-0807 for seven consecutive days is selected here.
# deduplication data = data [['c2nameplate,' redu', 'room_id',' room_name', 'time','user_id',' user_name'] .drop _ duplicates () # screening time data = data.loc [(data ['time'] =' 2019-08-01')]
We also need to summarize the anchors in groups according to id.
First use the groupby classification summary, and then calculate the addition of new columns.
Data_abc ['av_redu'] = data_abc [' redu'] / data_abc ['time_num'] data_abc [' hour'] = data_abc ['time_num'] / 42 # every ten minutes, seven days data_abc.head ()
In this way, we build another set of data indexed by the VJ.
In other words, there were more than 230000 live anchors in these seven days, so let's take a look at their living conditions.
Data visualization
Draw a scatter chart of the 230000 anchors according to the average live time and average live heat.
Import seaborn as snsimport matplotlib as mpl # configure font mpl.rcParams ['font.sans-serif'] = [' SimHei'] # specify the default font mpl.rcParams ['axes.unicode_minus'] plt.figure (figsize= (8)) plt.xticks (fontsize=12) plt.yticks (fontsize=12) sns.scatterplot (data_test ["hour"], data_test ["av_redu"], hue=data_test ["c2name"])
The result is shown in the following figure.
As can be seen from the above figure, most of the VJs are at the bottom, and very few can become primary VJs, and the hot VJs are concentrated in the above several popular zones, while the development of VJs in other regions is generally average.
As there are more than 200,000 anchors concentrated at the bottom, it is difficult to see the distribution of their average live time.
On the other hand, the degree of differentiation of anchors is relatively serious. In order to show the trend more intuitively, we take the average heat of 10,000 as the dividing line and analyze the average daily live time of anchors of different sizes.
# header VJ plt.figure (figsize= (10Magne6)) plt.xticks (fontsize=13) plt.yticks (fontsize=13) sns.distplot (data_abc.loc [(data_abc ['av_redu'] > 10000)] ["hour"], kde=True,rug=False,color='y') plt.show ()
In the figure, you can find that more VJs focus on about 5 hours of live broadcasting every day, and these 5 hours of games are not as simple as we usually play. During live broadcasting, anchors often need to concentrate on playing games and interact with the audience.
On the other hand, most of the less VJs live for about 1 hour, which can not be broadcast continuously, resulting in a small number of viewers; the number of viewers is small, and the VJs are not motivated, so it is difficult to take the lead over time, resulting in a vicious circle.
There are some outliers in the image above, that is, live rooms with an average daily live duration of more than 20 hours. Most of these live broadcasts are "watch together" zones, where movies and TV series can be played 24 hours a day. The rest are official channels for games or games, which are used to play official videos on a circular basis.
So when do most anchors broadcast live?
Is their audience watching on time at the same time?
From the number of live broadcasters and viewers watching online at the same time, we can see that there are differences between the two periods.
One is that from 21:00 in the evening to 6 o'clock in the morning, anchors who take live broadcast as their profession often have 5-6 hours of high-intensity uninterrupted live broadcasting, and will choose to have a good rest in the middle of the night, while viewers who watch live broadcast as entertainment lie on the bed and see it.
Another time is from 12:00 to 18:00, when viewers are going to work and school, and many full-time anchors get up for lunch and begin their live broadcast in the afternoon.
Most anchors are not what we think. Time is free and money is easy. There are 100,000 or even millions of anchors live online every day, but very few of them really win the audience's love and volunteer to brush a large number of gifts. Temporary traffic can not be bought by the audience forever, and how to retain the audience with content after a gimmick is the direction that every anchor is exploring.
With the strengthening of industry regulation, live streaming platforms gradually fade from the "bubble", traffic dividends disappear, and return to rationality. "Panda" has gone, the competition in the industry is more concentrated in the remaining head platforms, these platforms also need to explore more high-quality content and more diverse development!
This is the end of the content of "Python crawler data example Analysis". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.