How to use Python to visually analyze the data of the Top 500 list 07/04 Update SLTechnology News&Howtos

How to use Python to visually analyze the data of the Top 500 list

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

Today, I would like to share with you how to use Python to visually analyze the top 500 ranking data. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

I. Preface

Today, let's analyze the data of the list of China's top 500 enterprises in 2020 and make a statistical analysis and visual display of the data from different angles.

The main analysis contents are as follows:

Distribution of China's top 500 enterprises-provinces.

China's top 500 enterprises-annual growth rate of operating income.

China's top 500 enterprises-annual decline in operating income.

China's top 500 enterprises-annual profit growth rate.

China's top 500 enterprises-annual profit decline rate.

China's top 500 companies-ranking the fastest.

The ranking of China's top 500 companies has fallen the fastest.

China's top 500 enterprises-asset interval distribution.

China's top 500 enterprises-market capitalization range distribution.

China's top 500 enterprises-operating income interval distribution.

China's top 500 enterprises-profit range distribution.

China's top 500 enterprises-top 10 operating income, profits, assets, market capitalization, shareholders' equity and so on.

The following starts from data collection to data statistical analysis, and finally visualization!

II. Data acquisition

1. Start crawling to get the enterprise list url= "http://www.fortunechina.com/fortune500/c/2020-07/27/content_369925.htm"res = requests.get (url,headers=headers) res.encoding = 'utf-8'text = res.text get the corresponding urlfor i in range of the enterprise (0 Len (table_tr): try: # name = i.xpath ('. / / td/a/text ()') [0] href = table_ tr.xpath ('. / / td/a/@href') [0] .replace (".. /", "http://www.fortunechina.com/") column_list = get_detail (href) for k in range (0)" Len (column_list): outws.cell (row=count, column=k+1) Value=column_ list [k]) print (count) count = count+1 except: pass acquires data related to each enterprise name = selector.xpath ('/ / * [@ class= "comp-name"] / text ()') [0] R1 = selector.xpath ('/ / * [@ class= "con"] / em [@ class= "R1"] / text ()') [0] R2 = selector.xpath ('/ / * [@ class= "con") ] / span/em/font [@ class= "ft-red"] / text ()') [0] address = selector.xpath ('/ / * [@ class= "info"] / p') [0] .xpath ('. / / text ()') [0] .replace ("" ") table_tbody_tr = selector.xpath ('/ * [@ class=" table "] / table/tr') 2. Save to Exceloutwb = openpyxl.Workbook () outws = outwb.create_sheet (index=0) outws.cell (row=1, column=1, value= "Enterprise name") outws.cell (row=1, column=2, value= "ranking") outws.cell (row=1, column=3, value= "ranking") outws.cell (row=1, column=4, value= "headquarters address") outws.cell (row=1, column=5, value= "operating income") outws.cell (row=1, column=6, value= "Annual increase or decrease of operating income") outws.cell (row=1, column=7) Value= "profit") outws.cell (row=1, column=8, value= "annual profit increase or decrease") outws.cell (row=1, column=9, value= "assets") outws.cell (row=1, column=10, value= "market capitalization") outws.cell (row=1, column=11, value= "shareholder equity") outwb.save ("China Top 500s ranking data .xlsx") # Preservation

The data has been saved to Excel, let's start statistical analysis and visualization!

Third, visual analysis 1. Provincial Distribution Import related Visualization Library from pyecharts import options as optsfrom pyecharts.charts import Linefrom pyecharts.charts import Mapimport pandas as pdfrom pyecharts import options as optsfrom pyecharts.globals import ThemeTypefrom pyecharts.charts import Bar Statistics

Take out the excel: headquarters address, and then take out the first two digits (provinces) to count the distribution of the top 500s in each province.

Address = pd_data ['headquarters address'] address = address.tolist () address_03 = [] for i in address: # take province (first two) address_03.append (I [0:2]) data = [] address_03_set = set (address_03) # address_03_set is another list The contents are for item in address_03_set without duplicates in address_03: data.append ((item,address_03.count (item) Map Visualization def map_china ()-> Map: C = (Map () .add (series_name= "number of Enterprises", data_pair=data, maptype= "china", zoom = 1 Center= [105 FFE4E1 38]) .set _ global_opts (title_opts=opts.TitleOpts), visualmap_opts=opts.VisualMapOpts (max_=9999,is_piecewise=True, pieces= [{"max": 9, "min": 0, "label": "0-9", "color": "# FFE4E1"}, {"max": 99, "min": 10) "label": "10-99", "color": "# FF7F50"}, {"max": 499, "min": 100, "label": "100-499", "color": "# F08080"}, {"max": 999, "min": 500,500-999, "color": "# CD5C5C"}, {"max": 9999 "min": 1000, "label": "> = 1000", "color": "# 8B0000"}])) return c

two。 Annual increase rate of operating income

Withdraw from excel: the annual increase or decrease of operating income, the top 50 with the largest statistical increase rate and the top 50 with the largest reduction rate (negative)

Income_rate = pd_data ['Annual increase or decrease in operating income'] compare_name = pd_data ['Enterprise name'] income_rate = income_rate.tolist () compare_name = compare_name.tolist () m = income_rate# find the maximum 50 numbers in a list, and sort the two largest numbers corresponding to max_number = heapq.nlargest (50, m) # If you use nsmallest, it is to find the minimum number and its index max_index = map (m.index, heapq.nlargest (50, m)) # max_index directly output the number is not the number Using list () or set () can output # print (set (max_index)) # {235,140,273,148,86} max_index = list (set (max_index)) # ss = [m.index (j) for j in max_number] name = [compare_ name [k] for k in set (max_index)] outwb = openpyxl.Workbook () outws = outwb.create_sheet (index=0)

3. Annual decline rate of operating income income_rate = income_rate.tolist () compare_name = compare_name.tolist () m = income_rate# find the minimum 50 numbers in a list, and sort min_number = heapq.nsmallest (60, m) min_index = [m.index (j) for j in min_number] name = [compare_ name [k] for k in set (min_index)]

4. Annual profit growth rate

Take out from excel: annual increase or decrease in profits, the top 50 with the largest statistical increase rate and the top 50 with the largest reduction rate (negative)

5. Annual rate of decrease in profit

6. Top 20 companies rising in the rankings

Take out the excel: ranking in 2020 and 2019, compare the top 20 companies with the largest rise in rankings, and the top 20 companies with the largest decline in rankings.

# broken line def LinePic: (Line () # sets global .set _ global_opts (tooltip_opts=opts.TooltipOpts (is_show=True), # displays prompts, defaults to display, and can not write xaxis_opts=opts.AxisOpts (type_= "category") Yaxis_opts=opts.AxisOpts (type_= "value", axistick_opts=opts.AxisTickOpts (is_show=True), splitline_opts=opts.SplitLineOpts (is_show=True),) ) # add x-axis point .add _ xaxis (xaxis_data=x_data) # add y-axis point .add _ yaxis (series_name=name, y_axis=y_data, symbol= "emptyCircle", is_symbol_show=True, label_opts=opts.LabelOpts (is_show=True) ) # Save as a html file .render (name+ ".html"))

7. Top 20 companies falling in the rankings

8. Asset interval distribution

Take out the assets from the excel, divide the intervals for 9000 intervals, and count the number of each interval.

For k, g in groupby (sorted (assets_list), key=lambda x: X / / 90000): name.append (str (k * 90000) + "~" + str ((k + 1) * 90000-1) dict_value.append (int (len (list (g)

9. Market value interval distribution

Take it out of the excel: market capitalization, divide the interval for 7000 intervals, and count the number of each interval.

For k, g in groupby (sorted (assets_list), key=lambda x: X / / 7000): name.append (str (k * 7000) + "~" + str ((k + 1) * 7000-1) dict_value.append (int (len (list (g)

10. Operating income interval distribution

Take out from the excel: operating income, divide the interval for 50000 intervals, and count the number of each interval.

For k, g in groupby (sorted (assets_list), key=lambda x: X / / 50000): name.append (str (k * 50000) + "~" + str ((k + 1) * 50000-1) dict_value.append (int (len (list (g)

11. Profit interval distribution

Take it out of the excel: the profit is divided into 5000 intervals, and the number of each interval is counted.

For k, g in groupby (sorted (assets_list), key=lambda x: Universe 5 000): name.append (str (assets_list 5 000) + "~" + str ((KL1) * 5000-1) dict_value.append (int (len (list (g)

twelve。 China's top 500 companies-top 10 operating income, profits, assets, market capitalization, shareholders' equity

Take out the top 10 from excel: * * operating income, * * profits, assets, market capitalization, shareholders' equity,

Name = pd_data ['Enterprise name'] [0:11] .tolist () data_1 = pd_data ['operating income'] [0:11] .tolist () data_2 = pd_data ['profit'] [0:11] .tolist () data_3 = pd_data ['assets'] [0:11] .tolist () data_4 = pd_data ['market capitalization'] [0:11] .tolist () data_5 = pd_data [ 'shareholder equity'] [0:11] .tolist () # chain call c = (Bar (init_opts=opts.InitOpts (# initial configuration item theme=ThemeType.MACARONS Animation_opts=opts.AnimationOpts (animation_delay=1000, animation_easing= "cubicOut" # initial animation delay and slow effect)) .add _ xaxis (xaxis_data=name) # x-axis. Add _ yaxis (series_name= "revenue", yaxis_data=cleardata (data_1)) # y-axis. Add _ yaxis (series_name= "profit" Yaxis_data=cleardata (data_2)) # y axis. Add _ yaxis (series_name= "assets", yaxis_data=cleardata (data_3)) # y axis. Add _ yaxis (series_name= "market capitalization", yaxis_data=cleardata (data_4)) # y axis. Add _ yaxis (series_name= "shareholder equity", yaxis_data=cleardata (data_5)) # y axis. Set _ global_opts (title_opts=opts.TitleOpts (title='') Subtitle=' Top 10 Economic conditions', # title configuration and relocation title_textstyle_opts=opts.TextStyleOpts (font_family='SimHei', font_size=25, font_weight='bold', color='red',), pos_left= "90%", pos_top= "10" ), xaxis_opts=opts.AxisOpts (name=' Enterprise name', axislabel_opts=opts.LabelOpts (rotate=20)), # set x name and Label rotate tag name too long to use yaxis_opts=opts.AxisOpts (name=' unit: millions of US dollars'),) .render ("Top 500 China 2020-Top 10 Economic situation .html")

These are all the contents of the article "how to use Python to visually analyze the data of the Top 500". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.