Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How Python depicts data

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how Python describes data". In daily operation, I believe many people have doubts about how Python describes data. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how Python describes data". Next, please follow the editor to study!

Group parameter

Population parameters are represented by some numbers to represent the characteristics of the group. We have introduced two population parameters, population average and population variance, in the statistical overview. The group average (population mean) reflects the overall state of the population and is defined as follows:

μ = 1N ∑ I = 0Nxiμ = 1N ∑ i=0Nxi

The population variance (population variance) reflects the discrete state of the population and is defined as follows:

σ 2x 1N ∑ ionomer 0N (xi − μ) 2 σ 2m 1N ∑ ionomer 0N (xi − μ) 2

The square root of variance, σ σ, is called group standard deviation (standard deviation). From a physical point of view, the units of mean and standard deviation are the same as the original data. In most statistical cases, most of the group data fall within the range of the average plus or minus a standard deviation.

There are also some parameters that can only be obtained by sorting the group members. For example, the maximum value (max) and the minimum value (min) of the population. In this type of parameter, the median (median) and quartile (quartile) are also often used. After sorting the members, the value of the innermost member is the median. If the total number of groups is even, then the median is the average of the values of the middle two members. According to the criteria of greater than or less than the median, members can be divided into two groups with the same number. If you calculate the median for these two groups, you can get the lower quartile (lower quartile) and the upper quartile (upper quartile). The distance between Q1Q1 and Q3Q3, called quartile distance (IQR,inter quartile range), is also a common population parameter. We use the following symbols to indicate:

Q1=lower quartileQ1=lower quartile

Q2=M=medianQ2=M=median

Q3=upper quartileQ3=upper quartile

IQR=Q3 − Q1IQR=Q3 − Q1

The median is divided according to 50%, the lower quartile is divided by 25%, and the upper quartile is divided by 75%. In fact, both the median and quartile belong to the percentile (percentile). We divide the data in any proportion to get the percentile. Arrange the data by numerical value, and the value of the member in the p% position is called the p percentile.

We can calculate the description parameters of the height data of Xiangbei high school students:

Mean: 172.075924variance: 102.570849846standard deviation: 10.1277267857median: 172.21lower percentile: 165.31upper percentile: 178.9025IQR: 13.5925

The code is as follows:

Import numpy as npwith open ("xiangbei_height.txt", "r") as f: lines = f.readlines () x = list (map (float, lines)) print ("mean:", np.mean (x)) print ("variance:", np.var (x)) print ("standard deviation:", np.std (x) print ("median:", np.median (x)) print ("lower percentile:", np.percentile (x, 25)) print ("upper percentile:", np.percentile (x) 75) print ("IQR:", np.percentile (x, 75)-np.percentile (x, 25)) data drawing

Data mapping takes advantage of human sensitivity to shape. Through data drawing, we can convert the digital geometry to make the information in the data easier to digest. Data drawing used to be a time-consuming and laborious manual work, but the development of computer graphics has made it easier. In the past two years, there is a new rise of "data visualization", using a lot of dazzling means to present data. But in the final analysis, there are only a few classic drawings, such as pie charts, scatter charts and graphs. The innovative techniques in "data visualization" are only derived from these classical methods. As people have formed the established habit of data drawing, excessive innovation in the way of drawing may even mislead readers. Therefore, what appears here is also the classical form of statistical drawing.

Since this series of statistical tutorials mainly use Python, I will introduce several classic data drawing methods based on Matplotlib. Matplotlib is a set of Python toolkits based on numpy, which provides rich data drawing tools. Of course, Matplotlib is not the only option. Some statisticians prefer the R language, while Web developers use D3.js. After you are familiar with a drawing tool, you can always follow the analogy and quickly master other tools.

Pie chart

We will take the GDP data of several countries in 2011 as examples to see how to draw classic pie and bar charts. The data are as follows:

USA 15094025China 11299967India 4457784Japan 4440376Germany 3099080Russia 2383402Brazil 2293954UK 2260803France 2217900Italy 1846950

This is a group with only 10 members. The value of the group member is the total GDP of that member in 2011. The unit here is (millions of dollars).

Let's first draw the pie chart (pie plot). Drawing a pie chart is like dividing a pizza. The whole pizza represents the sum of the values of the members. Each member takes a piece of pizza of the corresponding size according to the size of his or her value. Draw the above data into a pie chart:

As can be seen from the picture, the United States and China account for a large share in this "pie-sharing" game. However, what people read from the pie chart is the proportion, and there is no way to get the specific value of the members. Therefore, the pie chart is suitable for representing the percentage of member values in the sum. The code of the pie chart above is as follows:

Import matplotlib.pyplot as plt# quants: GDP# labels: country namelabels = [] quants = [] # Read datawith open ('major_country_gdp.txt', 'r') as f: for line inf: info = line.split () labels.append (info [0]) quants.append (float (info [1]) print (quants) # make a square figureplt.figure (1, figsize= (6)) # For China, make the piece explode a bitdef explode (label) Target='China'): if label = = target: return 0.1else: return 0expl = list (map (explode,labels)) # Colors used. Recycle if not enough.colors = ["pink", "coral", "yellow", "orange"] # Pie Plot# autopct: format of "percent" string;plt.pie (quants, explode=expl, colors=colors, labels=labels, autopct='%1.1f%%',pctdistance=0.8, shadow=True) plt.title ('Top 10 GDP Countries (2011)', bbox= {'facecolor':'0.8',' pad':5}) plt.show () bar chart and histogram

The disadvantage of the pie chart is that it is unable to express the specific values of the members, and the bar chart (bar plot) is used to present the data values. The bar chart draws vertical bars one by one, and the height of the bar represents the value. Or use the above data from GDP in 2011 and draw it with a bar chart:

The bar chart has both horizontal and vertical directions. The country corresponding to each vertical bar is marked horizontally, and the value of GDP is marked vertically. In this way, readers can read the GDP of each country. The code drawn above is as follows:

Import matplotlib.pyplot as pltimport numpy as np# quants: GDP# labels: country namelabels = [] quants = [] # Read datawith open ('major_country_gdp.txt') as f: for line inf: info = line.split () labels.append (info [0]) quants.append (float (info [1])) width = 0.4ind = np.linspace (0.5 Magazine 9.5) # make a square figurefig = plt.figure (1 Figsize= # Bar Plotax.bar (ind-width/2,quants,width,color='coral') # Set the ticks on x-axisax.set_xticks (ind) ax.set_xticklabels (labels) # labelsax.set_xlabel ('Country') ax.set_ylabel (' GDP (Million US dollar)') # titleax.set_title ('Top 10 GDP Countries (2011)', bbox= {'facecolor':'0.8',' pad':5}) plt.show ()

The basic bar chart is such a way to mark the value of data. If you want to know the value, you can read it directly from the data table without having to draw a bar chart. In statistical drawing, a kind of drawing method derived from bar chart is more commonly used: histogram. The histogram will preprocess the group data, and then draw the preprocessing results in the form of a bar chart. To take a simple example, the height data of all the students in Xiangbei High School are shown in the drawing. Imagine that if each student's height corresponds to a vertical bar, the picture will be densely packed with thousands of vertical bars, making it difficult to provide valuable information. But if it is drawn in the form of a histogram, it will look like this:

In this picture, the Abscissa becomes the height value. The width of each vertical bar corresponds to a certain range of height, such as 170cm to 172cm. The height of the vertical bar corresponds to the number of students whose height is in this range. Therefore, the histogram is first preprocessed by grouping, and then the total number of members contained in each group is drawn by the method of bar graph. In the processing of grouping, some original information is lost, so that it is impossible to read the specific height of the student from the vertical bar. But the simplified information becomes easier to understand. After looking at this picture, we can confidently say that most of the students' height is near 170cm. The proportion of students who are taller than 150cm or taller than 190cm is very small. If one only reads the raw data, it is difficult to get the above conclusion in a short period of time.

The histogram drawing program is as follows:

Import numpy as npimport matplotlib.pyplot as pltwith open ("xiangbei_height.txt", "r") as f: lines = f.readlines () x = list (map (float, lines)) plt.title ("Heights of Students (Shohoku High School)") plt.hist (x, 50) plt.xlabel ("height (cm)") plt.ylabel ("count") plt.show () [object Object]

The hist () function in the code is used to draw a histogram, where 50 indicates the number of interval groups to be generated. You can also specify the intervals in which groups are formed, as needed.

Trend chart

Trend graphs (run chart), also known as line charts, are often used to present time series. A time series is a set of data generated over time, such as the daily temperature in Shanghai last year and the GDP in China in the last 50 years. The trend chart will connect the data of adjacent time points with a straight line, thus visually reflecting the characteristics of the data changing with time. Trend charts are very common in daily life. For example, investors often use similar charts to understand the changes of stock prices over time. The following is the trend chart of China's GDP from 1960 to 2015:

In this trend chart, it is easy to see that China's GDP is growing rapidly over time. The code for the drawing is as follows:

Import numpy as npimport matplotlib.pyplot as plt# read datawith open ("China_GDP.csv", "r") as f: lines = f.readlines () info = lines [1] .split (",") # convert datax = [] y = [] def convert (info_item): return float (info_item.strip ('")) for count Info_item in enumerate (info): try: y.append (convert (info_item)) x.append (1960 + count) except ValueError: print ("% s is not a float"% info_item) # plotplt.title ("China GDP") plt.plot (x, y) plt.xlabel ("year") plt.ylabel ("GDP (USD)") plt.show () scatter chart

The way of drawing above is essentially a two-dimensional statistical chart. The pie chart is the two-dimensional information of country and proportion, the histogram reflects the two-dimensional relationship between height and number of people, and the two dimensions of the trend chart are time and GDP. Scatter plot (scatter plot) is the most direct way to express two-dimensional relations. Other ways of two-dimensional drawing can be understood as a variation of scatter plot.

Scatter plots present data by marking data points on a two-dimensional plane. If we want to study the relationship between height and weight of Xiangbei high school students, we can mark the data of all members on the two-dimensional plane that represents "height-weight":

In this scatter diagram, the horizontal of the two-dimensional plane represents height, the vertical represents weight, and each dot represents a student. Through the horizontal and vertical coordinates corresponding to this point, the height and weight of the student can be read. The scatter chart can present all the data intuitively, so it can tell us the characteristics of the overall distribution. We can see from the picture that the weight generally increases with the increase of height.

The drawing code is as follows:

Import numpy as npimport matplotlib.pyplot as pltdef read_data (filename): with open (filename) as f: lines = f.readlines () return np.array (list (map (float, lines)) height = read_data ("xiangbei_height.txt") weight = read_data ("xiangbei_weight.txt") plt.scatter (height, weight) plt.title ("Shohoku High School") plt.xlabel ("height (cm)") plt.ylabel ("weight (kg)") plt.ylim ([20 " 120]) plt.show ()

Scatter points represent data through two-dimensional positions. In applications, three-dimensional data can also be represented by the size of scattered points. This evolved scatter graph is called a bubble plot. In addition to the size of the scatter, the bubble chart sometimes uses the color of the scatter to express higher-dimensional information.

Let's look at an example of a bubble chart. The following picture shows the population of major cities in Asia. The location of the city contains two-dimensional information, namely longitude and latitude. In addition, the population constitutes the third dimension. We use the size of the scatter to represent this dimension.

The data are as follows:

Shanghai 23019148 31.23N 121.47E ChinaMumbai 12478447 18.96N 72.82E IndiaKarachi 13050000 24.86N 67.01E PakistanDelhi 16314838 28.67N 77.21E IndiaManila 11855975 14.62N 120.97E PhilippinesSeoul 23616000 37.56N 126.99E Korea (South) Jakarta 28019545 6.18S 106.83E IndonesiaTokyo 35682460 35.67N 139.77E JapanPeking 19612368 39.91N 116.39E China

Matplotlib's Basemap module is used in the code to draw the map:

From mpl_toolkits.basemap import Basemapimport matplotlib.pyplot as pltimport numpy as np#====# read datanames = [] pops = [] lats = [] lons = [] countries = [] with open ("major_city.txt" "r") as f: for line inf: info = line.split () names.append (info [0]) pops.append (float (info [1])) lat = float (info [2] [:-1]) if info [2] [- 1] = 'slots: lat =-lat lats.append (lat) lon = float (info [3] [:-1) ]) if info [3] [- 1] = = 'wow: lon =-lon + 360.0 lons.append (lon) country = info [4] countries.append (country) # = = # set up map projection with# use low resolution coastlines.map = Basemap (projection='ortho' Lat_0=35,lon_0=120,resolution='l') # drawcoastlines, country boundaries, fill continents.map.drawcoastlines (linewidth=0.25) map.drawcountries (linewidth=0.25) # draw the edge of the map projection region (the projection limb) map.drawmapboundary (fill_color='#689CD2') # draw lat/lon grid lines every 30 degrees.map.drawmeridians (np.arange) map.drawparallels (np.arange (- 90)) # Fill continent wit a different colormap.fillcontinents (color='#BF9E30',lake_color='#689CD2' Zorder=0) # compute native map projection coordinates of lat/lon grid.x, y = map (lons, lats) max_pop = max (pops) # Plot each city ina loop.# Set some parameterssize_factor = 160.0y_offset = 30adjust_size = lambda k: size_factor* (KMub 10000000) / max_popfor iGrai kphijjjjpjpjjjjjjwennamenames): cs = map.scatter (iMagazine scurvy staging size (k), marker='o' Color='#FF5600') plt.text (iGramage juniors offset10) print (iscorej) examples = [12000000, 24000000, 36000000] pop = 12000000plt.scatter (300,000,300000 pop), marker='o',color='red') plt.text (300,000,300000 quotes offsetmastr (pop/1000000) + "million", rotation=0,fontsize=10) pop = 24000000plt.scatter (3300,000,300000 parrons (pop), marker='o',color='red') plt.text (3300000, 300000+y_offset) Str (pop/1000000) + "million", rotation=0,fontsize=10) pop = 36000000plt.scatter (6300000, 300000 pop, marker='o',color='red') plt.text (6300000, 300000 pop/1000000) plt.title ('Major Cities in Asia & Population') plt.show ()

The previous drawing focused on the original data. There are also drawings to show group parameters, such as box diagrams (box plot). For example, the height data of Xiangbei high school is drawn as a box chart:

As noted in the figure, the box chart mainly reflects the median and quartile. The upper and lower quartile make up the box, which contains half of the data members. In addition, there are two boundaries, which are located at the height of 1.5 boxes extrapolated from each of the upper and lower edges of the box. If the extrapolation of 1.5 boxes exceeds the extreme value of the database, the boundary is replaced by the height of the extreme value. Otherwise, there will be data points beyond the boundary. These data points are considered to be outliers (outlier) and are drawn in the form of scattered dots.

The code is as follows:

Import matplotlib.pyplot as pltwith open ("xiangbei_height.txt", "r") as f: lines = f.readlines () x = list (map (float, lines)) plt.boxplot (x) plt.title ("boxplot of Shohoku High School") plt.xticks ([1], ['Shohoku']) plt.ylabel ("height (cm)") plt.show ()

The box diagram reflects an idea, that is, to draw the group parameters while drawing the original data, so as to help us understand the data. For example, we can mark the average and standard deviation in the histogram:

The code is as follows:

Import numpy as npimport matplotlib.pyplot as pltwith open ("xiangbei_height.txt", "r") as f: lines = f.readlines () x = list (map (float, lines)) plt.title ("Heights of Students (Shohoku High School)") plt.hist (x, 50) plt.xlabel ("height (cm)") plt.ylabel ("count") mu = np.mean (x) std = np.std (x) h = 120text_color = "white" plt.axvline (x=mu, color= "red") plt.text (mu Plt.axvline (x=mu-std, color= "coral") plt.text (mu-std, color= "coral") plt.text (mu+std) plt.show () how to draw a good picture

Although some commonly used data drawing methods are described here, there are many artificial factors in the process of data drawing. Therefore, the same database, or even the same drawing form, may produce a variety of data images. Different data images will have great differences in the effectiveness of transmitting information. How to draw a good data map? Based on my own experience, I have summarized the following criteria:

Determine the purpose. Although in the process of research, we will draw a large number of data graphs, but in the display of data maps, we should focus on.

Explain the main content of a data graph in the title.

Mark each axis and mark the scale and unit of the coordinates.

If there is no axis, you need to use a legend to illustrate the reading. For example, in a bubble diagram, a legend is used to illustrate the readings represented by the bubble size.

Label additional image elements in the diagram, such as the marking line representing the average value, the dotted curve representing the fitting, and so on.

Back up data, image files, and related code.

When introducing a data graph, you can also follow a certain order:

What is painted in a sentence: "this picture depicts the height distribution of high school students in northern Hunan."

Explain the axis: "the horizontal axis in the picture represents the height, and the vertical axis represents the number of people."

Explain the meaning of the main image elements: "each vertical bar corresponds to a certain height range." the height of the vertical bar represents the number of students in that height range. "

Explain the meaning of the secondary image element: "the red line represents the average height of the student."

Guide readers to read deeply: "you can see that most of the students' height is concentrated near the average."... "

Of course, for the existence of artificial factors of data mapping, there is no method. However, the establishment of a certain process can improve the efficiency of drawing. So I also suggest you set up your own drawing process.

At this point, the study on "how Python depicts the data" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report