In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
In this issue, the editor will bring you about how to quickly make beautiful, cool and in-depth charts with Python. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
A regular graph of positive correlation between life ladder (happiness index) and per capita GDP (money)
Three different ways to visualize data with Python. Taking the visual data of the 2019 World Happiness report as an example, this paper enriches the World Happiness report data with the information of Gapminder and Wikipedia in order to explore new data relations and visualization methods.
The World Happiness report tries to answer the factors that affect happiness around the world.
The report determined the happiness index based on the answer to the Cantrell ladder question, in which respondents were asked to rate their living conditions on a scale of 10 for the best and 0 for the worst.
Life Ladder will be used as the target variable. Life Ladder is the index of happiness.
Article structure
Photo Source: Nik MacMillan/Unsplash
The purpose of this article is to provide code guides and reference points for reference when looking for specific types of charts. To save space, multiple charts are sometimes merged into a single chart. But rest assured, you can find all the basic code in this Repo or the corresponding Jupyter Notebook.
1. My experience of drawing with Python
About two years ago, I began to study Python more seriously. Since then, Python has surprised me almost every week, not only is it easy to use itself, but it also has many amazing open source libraries in its ecosystem. The more familiar I am with commands, patterns, and concepts, the more I can make full use of their functions.
(1) Matplotlib
It's the opposite of drawing with Python. At first, almost every chart I created with matplotlib looked out of date. To make matters worse, I had to spend hours on Stackoverflow in order to create these annoying things. For example, study the basic command to change the x slope or something stupid like that. I don't want to do too many charts at all. It's amazing to create these charts programmatically, for example, to generate 50 different variables at a time, and the results are impressive. However, it involves a lot of work and a lot of useless instructions to remember.
(2) Seaborn
Learning Seaborn can save a lot of energy. Seaborn can abstract a lot of fine-tuning. There is no doubt that this has greatly improved the aesthetics of the chart. However, it is also built on top of matplotlib. In general, it is still necessary to use machine-level matplotlib code for non-standard adjustments.
(3) Bokeh
For a moment, I thought Bokeh would be a backup solution. I found Bokeh when I was doing geospatial visualization. However, I soon realized that although Bokeh is different, it is still as complex as matplotlib.
(4) Plotly
Not long ago I did try plot.ly (which will be expressed directly as plotly) for geospatial visualization as well. At that time, plotly was more troublesome than the library mentioned earlier. It must be logged in through a laptop account, then plotly can render it online, and then download the final chart. I gave up soon. However, I recently saw a Youtube video about plotlyexpress and plotly4.0, and the point is that they deleted all the online nonsense. I gave it a try, and this article is the result of the attempt. I think it's better to know late than not to know.
(5) Kepler.gl (Geospatial data Excellence Award)
Kepler.gl is not a Python library, but a powerful web-based geospatial data visualization tool. You can easily create files using Python just by requiring CSV files. Try it!
(6) current workflow
Finally, I decided to use Pandas local drawings for quick checks and Seaborn to draw charts to be used in reports and presentations (visual effects are important).
two。 The importance of distribution
During my research in San Diego, I was in charge of teaching statistics (Stats119). Stats119 is an introduction to statistics, including the basics of statistics, such as data aggregation (visualization and quantification), the concept of probability, regression, sampling, and, most importantly, distribution. This time, my understanding of numbers and phenomena has almost completely changed to a distribution-based understanding (mostly Gaussian distribution).
To this day, I am still amazed at the role of these two quantities, and the standard deviation can help people understand the phenomenon. As long as you know these two quantities, you can directly get the probability of the specific results, and the user will immediately know the distribution of most of the results. It provides a reference framework to quickly identify statistically significant events without overly complex calculations.
In general, my first step when dealing with new data is to try to visualize its distribution in order to better understand the data.
3. Load data and package import
Load the data used in this article first. I have preprocessed the data. And its significance is explored and inferred.
# Loadthe data data = pd.read_csv ('https://raw.githubusercontent.com/FBosler/AdvancedPlotting/master/combined_set.csv')#this assigns labels per year data [' Mean Log GDP per capita'] = data.groupby ('Year') [' Log GDP per capita'] .transform (pd.qcut, Qip5, labels= (['Lowest','Low','Medium','High','Highest']))
The dataset contains the following values:
Year: metrological year (2007-2018)
Life ladder: respondents measure their lives today on a scale of 0 to 10 (the most satisfied is 10) according to the Cantrell ladder (CantrilLadder).
Per capita GDP: according to the World Development indicators (WDI) issued by the World Bank on November 14, 2018, the per capita GDP is adjusted to PPP (2011 constant international yuan)
Social support: answer to the following question: "when you encounter difficulties, can you get help from relatives or friends at any time?"
Life expectancy at birth: life expectancy at birth is based on the World Health Organization (WHO) Global Health Observatory (GHO) database, with data from 2005, 2010, 2015 and 2016.
Freedom of choice: answer the following question: "are you satisfied with your freedom to choose your life?"
Generosity: to "have you donated money to charity in the past month?" Compared with per capita GDP
Political incorruptibility: answer "is corruption common in the government?"is corruption common within enterprises?"
Positive effects: including the average frequency of happiness, laughter and enjoyment the day before.
Negative effects: including the average frequency of anxiety, sadness and anger the day before.
Confidence in the national government: self-evident
The quality of Democracy: the degree of Democracy in a country
Quality of implementation: a country's policy implementation
Life expectancy of Gapminder: life expectancy of Gapminder
Gapminder population: national population
Import
Import plotly import pandas as pd import numpy as np import seaborn as sns import plotly.express as pximport matplotlib%matplotlib inlineassertmatplotlib.__version__ = "3.1.0", "" Please install matplotlib version 3.1.0 by running: 1)! pip uninstall matplotlib 2)! pip install matplotlib==3.1.0 ""
4. Quick: use Pandas for basic drawing
Pandas has a built-in drawing function that can be called on Series or DataFrame. I like these drawing functions because they are concise, use reasonable intelligent defaults, and quickly give a degree of progress.
Create the chart and call .plot (kind=) in the data, as follows:
Np.exp (data [data ['Year'] = = 2018] [' LogGDP per capita']) .plot (kind='hist') runs the above command to generate the following chart.
2018: per capita GDP histogram. It is not surprising that most countries are poor.
When drawing with Pandas, there are five main parameters:
Kind:Pandas must know what kind of chart to create, and there are the following options: histogram (hist), bar chart (bar), horizontal bar chart (barh), scatter chart (scatter), area (area), kernel density estimation (kde), line chart (line), box (box), hexagon (hexbin), pie chart (pie).
Figsize: allows the default output size of 6 inches wide and 4 inches high. A tuple is required (for example, I often use figsize= (122.8))
Title: add a title to the chart. In most cases, you can use this title to indicate what is shown in the chart, so that when you look back, you can quickly identify the contents of the table. Title requires a string.
Bins: the bin width of the histogram. Bin requires a list of values or a similar list sequence (for example, bins=np.arange (2p8pm 0.25))
Xlim/ylim: the maximum and minimum default values for the axis. It is better for both xlim and ylim to have a tuple (for example, xlim= (0jre 5))
Let's take a quick look at different types of pictures.
(1) Vertical bar chart:
Data [data ['Year'] = = 2018]. Set_index (' Country name') ['Life Ladder'] .nambiest (15). Plot (kind='bar', figsize= (12mem8))
2018: Finland ranks first among the 15 happiest countries
(2) horizontal bar chart:
Np.exp (data [data ['Year'] = = 2018] .groupby (' Continent') ['Log GDP per capita']\ .mean ()) .sort_values () .plot (kind='barh', figsize= (12mem8))
Australia and New Zealand have a clear lead in per capita GDP (USD) in 2011
(3) Box diagram
Data ['Life Ladder'] .plot (kind='box', figsize= (12meme 8))
The block diagram of the distribution of the life ladder shows that the average value is about 5.5, with a range of 3 to 8.
(4) scatter plot
Data [['Healthy life expectancyat birth','Gapminder Life Expectancy']] .plot (kind='scatter', x='Healthy life expectancyat birth', y='Gapminder Life Expectancy', figsize= (12. 8))
The scatter plot shows a high correlation between the life expectancy of the World Happiness report and the life expectancy of Gapminder.
(5) Hexbin diagram
Data [data ['Year'] = = 2018] .plot (kind='hexbin', x='Healthy life expectancy at birth', yawning generalization, C='Life Ladder', gridsize=20, figsize= (December 8), cmap= "Blues", # defaults togreenish sharex=False # required to get rid ofa bug)
2018: Hexbin chart, showing the relationship between life expectancy and generosity. The color of the lattice indicates the average life of each grid.
(6) Pie chart
Data [data ['Year'] = = 2018] .groupby ([' Continent']) ['Gapminder Population'] .sum () .plot (kind='pie', figsize= (12mem8), cmap= "Blues_r", # defaultsto orangish)
2018: pie chart of total population by continent
(7) stacking area diagram
Data.groupby (['Year','Continent']) [' Gapminder Population'] .sum (). Unstack (). Plot (kind='area', figsize= (12Power8), cmap= "Blues", # defaults toorangish)
The global population is growing.
(8) Line chart
Data [data ['Country name'] = =' Germany'] .set_index ('Year') [' Life Ladder'] .plot (kind='line', figsize= (1298))
A line chart showing the development of the happiness index in Germany
(9) Summary of Pandas drawing
It is convenient to draw with pandas. Easy to access and fast. It's just that the chart looks so ugly that it's almost impossible to deviate from the default value. But it doesn't matter, because there are other tools to make more beautiful charts.
5. Beautiful: use Seaborn for advanced drawing
Seaborn uses the default drawing. To ensure that the results are consistent with this article, run the following command.
Sns.reset_defaults () sns.set (rc= {'figure.figsize': (7 figure.figsize': 5)}, style= "white" # nicerlayout)
(1) drawing univariate distribution
As mentioned earlier, I like distribution very much. Both histogram and kernel density distribution are effective methods to visualize the key features of specific variables. Let's take a look at how to generate a single variable or multiple variable distribution in a chart.
Left: life ladder histogram and nuclear density estimation of Asian countries in 2018
Right: core density estimation of five groups of per capita GDP life ladder-- reflecting the relationship between money and happiness index
(2) drawing binary distribution
Whenever I want to visually explore the relationship between two or more variables, I always use some form of scatter plot and distribution assessment. There are three variations of diagrams that are conceptually similar. In each graph, the center graph (scatter chart, binary KDE, hexbin) helps to understand the joint frequency distribution between the two variables. In addition, the marginal univariate distribution of their respective variables (represented by KDE or histogram) is described on the right boundary and upper boundary of the center graph.
Sns.jointplot (x='Log GDP per capita', y='Life Ladder', datadata=data, kind='scatter' # or 'kde' or' hex')
Seaborn double plots, scatter plots, binary KDE and Hexbin graphs are all in the center map, and the edges are distributed on the left and top of the center map.
(3) scatter plot
Scatter plot is a method to visualize the joint density distribution of two variables. You can add a third variable by adding chromaticity and a fourth variable by adding a size parameter.
Sns.scatterplot (x='Log GDP per capita', y='Life Ladder', datadata=data [data ['Year'] = = 2018], hue='Continent', size='Gapminder Population') # both, hue and size are optional sns.despine () # prettier layout
The relationship between per capita GDP and life ladder. Different colors represent different continents and population sizes.
(4) Violin pictures
The violin chart combines the box diagram with the kernel density estimate. It acts like a box chart, showing the distribution of quantitative data among classified variables in order to compare these distributions.
Sns.set (rc= {'figure.figsize': (186.6)}, style= "white") sns.violinplot (xylene continuity, y='Life Ladder', hue='Mean Log GDP per capita', datadata=data) sns.despine ()
When drawing the relationship between the continent and the life ladder, the violin chart groups the data with the average of GDP per capita. The higher the per capita GDP, the higher the happiness index.
(5) pairing diagram
Seaborn pairing diagrams are all combinations of bivariate scatter plots in a large grid. I usually think it's a bit of an information overload, but it helps to find patterns.
Sns.set (style= "white", palette= "muted", color_codes=True) sns.pairplot (data [data.Year = = 2018] [['Life Ladder','Log GDP percapita',' Social support','Healthy lifeexpectancy at birth', 'Freedom to make lifechoices','Generosity',' Perceptions of corruption','Positive affect',' Negative affect','Confidence innational government'] 'Mean Log GDP per capita']] .dropna (), hue='Mean Log GDP per capita')
In the Seaborn scatter grid, all the selected variables are scattered in the lower and upper half of the grid, and the diagonal contains the Kde diagram.
(6) FacetGrids
For me, Seaborn's FacetGrid is one of the most convincing evidence that it works, because it can easily create multiple charts. Through the pairing diagram, we have seen an example of FacetGrid. It can create multiple charts grouped by variables. For example, a row can be a variable (category of GDP per capita) and a column another variable (continent).
It does need to adapt to customer needs (that is, using matplotlib), but it is still convincing.
(7) FacetGrid- line chart
G = sns.FacetGrid (data.groupby (['Mean Log GDP percapita','Year','Continent']) [' Life Ladder'] .mean () .reset_index (), row='Mean Log GDP percapita', col='Continent', margin_titles=True) g = (g.map (plt.plot, 'Year','Life Ladder'))
The y-axis represents the life ladder and the x-axis represents the year. The columns of the grid represent continents, and the rows of the grid represent different levels of per capita GDP. Overall, the situation seems to have improved in countries with low average per capita GDP in North America and medium or high average GDP per capita in Europe.
(8) FacetGrid- histogram
G = sns.FacetGrid (data,col= "Continent", col_wrap=3,height=4) g = (g.map (plt.hist, "Life Ladder", bins=np.arange)
Life ladder histogram by continent
(9) FacetGrid- annotated KDE diagram
You can also add specific comments to each chart in the grid. The following example adds the average and standard deviation and the vertical line drawn at the average value (the code is as follows).
Based on the estimated nuclear density of the life ladder on the continent, annotated as mean and standard deviation
Defvertical_mean_line (x, * * kwargs): plt.axvline (x.mean (), linestyle= "- -", color= kwargs.get ("color", "r")) txkw = dict (size=15, color= kwargs.get ("color", "r")) label_x_pos_adjustment = 0.08 this needs customization based on your data label_y_pos_adjustment = this needs customization based on your data if x.mean ()
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.