What are the visualization skills of Python datasets 07/09 Update SLTechnology News&Howtos

What are the visualization skills of Python datasets

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "what are the visualization skills of Python datasets". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what are the visualization skills of Python datasets?"

This article covers three practical visualization tools:

Correlation of graphic classification

Scatter plot matrix

Using Seaborn's classified scatter plot and diagram annotations

In general, this article will teach you to make some good-looking and useful charts.

This article will use the FIFA 2019 complete player data set on kaggle, whose * * version of the database contains the details of each registered player.

Because the dataset has many columns, we only focus on a subset of classified and contiguous columns.

Import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt% matplotlib inline # We dont Probably need the Gridlines. Do we? If yes comment this line sns.set (style= "ticks") player_df = pd.read_csv (".. / input/data.csv") numcols = ['Overall',' Potential', 'Crossing','Finishing',' ShortPassing', 'Dribbling','LongPassing',' BallControl', 'Acceleration',' SprintSpeed', 'Agility',' Stamina', 'Value','Wage'] catcols = [' Name','Club','Nationality','Preferred Foot','Position' 'Body Type'] # Subset the columns player_dfplayer_df = player_ DF [numcols + catcols] # Few rows of data player_df.head (5)

Player data

Although the data is well formatted, because the salary and value columns are in euros and contain strings, some preprocessing is required for them to provide values for subsequent analysis.

Def wage_split (x): try: return int (x.split ("K") [0] [1:]) except: return 0 player_df ['Wage'] = player_df [' Wage'] .apply (lambda x: wage_split (x)) def value_split (x): try: if 'M'in x: return float (x.split ("M") [0] [1) :]) elif 'K'in x: return float (x.split ("K") [0] [1:]) / 1000 except: return 0 player_df ['Value'] = player_df [' Value'] .apply (lambda x: value_split (x))

Correlation of graphic classification

To put it simply, correlation is an indicator of how two variables move together.

For example, in real life, income and expenditure are positively correlated, and one variable increases with the increase of the other.

There is a negative correlation between academic achievement and the use of video games, and the increase of one variable means the decrease of the other.

Therefore, if the predictive variable is positively or negatively correlated with the target variable, then the variable has research value.

Studying the correlation between different variables is very meaningful for understanding data.

Using Seaborn, you can easily create a fairly good diagram.

Corr = player_df.corr () g = sns.heatmap (corr, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws= {"shrink": .5}, annot=True, fmt='.2f', cmap='coolwarm') sns.despine () g.figure.set_size_inches (14) plt.show ()

Where are all the classified variables?

Did you notice anything wrong?

There is a problem because the figure only calculates the correlation between numeric columns.

What happens if the target variable is club or position?

If you want to get the correlation between three different situations, you can use the following correlation measure to calculate.

1. Numerical variable

This variable can be obtained by Pearson correlation, which is used to measure how two variables move together, in the range of [- 1].

two。 Classified variable

Clem V coefficient is used to classify cases. This coefficient is the correlation between two discrete variables and is used with variables with two or more levels. It is also a symmetrical measure, because the order of the variables does not matter, that is, Clem (AMagi B) = = Clem (BMaga A).

For example, in a dataset, there must be some connection between Club and Nationality.

This can be verified by a stack diagram, which is a way to understand the distribution between classified variables and classified variables, because there are many nationalities and clubs in this data, so a subset of the data is used.

Keep only the teams (Porto Football Club is only to diversify the sample) and the most common nationality.

Club preferences largely reflect "nationality": knowing the former helps to predict the latter.

From the picture, English players are more likely to play for Chelsea or Manchester United than for Barcelona, Bayern Munich or Porto.

In the same way, the Clem V coefficient captures the same information.

If all clubs have the same proportion of nationality, then Clem's V coefficient is 0.

If each club prefers players of a single nationality, then the Clem coefficient is 1, for example, all English players play for Manchester United, all German players play for Bayern Munich and so on.

In all other cases, the range is [0BZ 1].

3. Numerical variables and classified variables

Use the correlation ratio for consecutive classification cases.

Without too much mathematics involved, this variable is used to measure the degree of dispersion.

If you give a number, can you find out its category?

For example, if there are two column classifications of "SprintSpeed" and "Position" in the dataset, then:

Goalkeepers: 58 (De Gea), 52 (T. Courtois), 58 (M. Neuer), 43 (G. Buffon)

Centre-backs: 68 (D. Godin), 59 (V. Kompany), 73 (S. Umtiti), 75 (M. Benatia)

Forwards: 91 (C.Ronaldo), 94 (G. Bale), 80 (S.Aguero), 76 (R. Lewandowski)

From the above, it can be seen that these figures well predict where they are, so the correlation is very high.

If a player sprints faster than 85, then the player must be a striker.

The ratio is also in the range of [0jue 1].

The code to do this is taken from the dython package, and there won't be much code. The final result is as follows:

Player_dfplayer_df = player_df.fillna (0) results= associations (player_df,nominal_columns=catcols,return_results=True)

Classification vs. Classification, classification vs. Numerical value, numerical value vs. Numerical values, which make the chart more interesting.

It's beautiful, isn't it?

Just look at the data, you can know so much about football, for example:

The position of the player is highly related to his dribbling ability. You can't let Messi play as a defender.

The correlation between value and passing and possession is higher than dribbling. The rule is to pass the ball forever, just like Neymar's.

"Club" and "income" are highly correlated and predictable.

There is a high correlation between "body size" and "football preference feet". Does this mean that if a player is thin, he is likely to play with his left foot? This may be of little practical significance and needs further investigation.

In addition, so much of the above information can be found through this simple chart, which is not seen in typical correlation diagrams with no classified variables.

You can study this chart in depth and get more meaningful results, but the key is that the chart makes it easier for people to find certain rules in real life.

Scatter plot matrix

Although the previous article talked about a lot of correlation, but it is a fickle indicator, in order to let you understand, let's look at an example.

The Anscombe Quartet consists of four data sets whose correlation is almost close to 1, but they have very different data distributions and show very different effects when drawn.

Anscombe Quartet: correlation is fickle

As a result, sometimes it becomes critical to draw relevant data and need to view the distribution separately.

Now that there are many columns in the dataset, it can be laborious to draw them all into graphs.

In fact, it only takes a few lines of code to solve it.

Filtered_player_df = player_df [(player_df ['Club'] .isin ([' FC Barcelona', 'Paris Saint-Germain',' Manchester United', 'Manchester City',' Chelsea', 'Real Madrid','FC Porto','FC Bayern M ü nchen']) & (player_df ['Nationality'] .isin ([' England', 'Brazil',' Argentina', 'Brazil',' Italy') 'Spain','Germany'])] # Single line to create pairplot g = sns.pairplot (filtered_player_df [[' Value','SprintSpeed','Potential','Wage']])

Very good, you can see a lot of information in this picture.

Wages and values are highly correlated.

Most other values are also relevant, but the trend of the ratio of "potential" to "value" is unusual. You can see how the value increases exponentially when a particular potential threshold is reached. This information is very helpful for modeling, can you transform the "potential" to make it more relevant?

Warning: there are no classified columns!

Can we do better on this basis? It can always be done.

G = sns.pairplot (filtered_player_df [['Value','SprintSpeed','Potential','Wage','Club']], hue =' Club')

There is a lot of information on the diagram, just add the "hue" parameter to the classification variable "club".

The wage distribution of Porto tends to the lower end.

The picture does not show the sharp distribution of the value of Porto players, who are always looking for opportunities.

Many pink dots (representing Chelsea) form a cluster on the "potential" and "wage" charts. Chelsea have a lot of low-paid, high-potential players who need more attention.

You can also get some information from the salary / value submap.

The blue dot with an annual salary of 500000 is Messi. In addition, the orange dot that is more valuable than Messi is Neymar.

Although this technique still can not solve the classification problem, there are some other methods to study the distribution of classification variables, although it is an example.

Classified scatter plot

How do I view the relationship between classified data and digital data?

Just like entering a name, enter a picture of the classified scatter chart. Draw a set of points for each category and spread them slightly on the y-axis for easy viewing.

This is our current method of mapping this relationship.

G = sns.swarmplot (y = "Club", x = 'Wage', data = filtered_player_df, # Decrease the size of the points to avoid crowding size = 7) # remove the top and right line in graph sns.despine () g.figure.set_size_inches (14 and 10) plt.show ()

Classified scatter plot

Why not use a box chart? Where is the median? Can you draw it? Yes, of course. Cover the top with a bar chart and you get a good-looking figure.

G = sns.boxplot (y = "Club", x = 'Wage', data = filtered_player_df, whis=np.inf) g = sns.swarmplot (y = "Club", x =' Wage', data = filtered_player_df, # Decrease the size of the points to avoid crowding size = 7 Color = 'black') # remove the top and right line in graph sns.despine () g.figure.set_size_inches (1213) plt.show ()

Interesting classification scatter chart + box chart

Well, you can see the distribution of each point and some statistics on the chart, and you can clearly understand the wage difference.

The rightmost point in the picture is Messi, so you don't have to explain it through the text at the bottom of the chart.

The picture can be used for a demonstration, and if the boss asks to write Messi on this picture, you can add a picture note.

Max_wage = filtered_player_df.Wage.max () max_wage_player = filtered_player_df [(player_df ['Wage'] = = max_wage)] [' Name'] .values [0] g = sns.boxplot (y = "Club", x = 'Wage', data = filtered_player_df, whis=np.inf) g = sns.swarmplot (y = "Club", x =' Wage' Data = filtered_player_df, # Decrease the size of the points to avoid crowding size = 7 remove the top and right line in graph sns.despine () # Annotate. Xy for coordinate. Max_wage is x and 0 is y. In this plot y ranges from 0 to 7 for each level # xytext for coordinates of where I want to put my text plt.annotate (s = max_wage_player, xy = (max_wage,0), xytext = (500 width': 1), # Shrink the arrow to avoid occlusion arrowprops = {'facecolor':'gray',' width': 3, 'shrink': 0.03} Backgroundcolor = 'white') g.figure.set_size_inches (12pm 8) plt.show ()

Annotated statistics and point groups can be used in speeches.

Look at Porto at the bottom of the picture. The salary budget is so small that it is difficult to compete with other high-income teams.

Real Madrid and Barcelona have a lot of well-paid players.

The median wage at Manchester United is *.

Manchester United and Chelsea focus on equality, and many players earn the same salary.

Although Nemal is more valued than Lionel Messi, the wage gap between Messi and Neymar is huge.

It can be seen that in this crazy world, some normality is only superficial.

At this point, I believe you have a deeper understanding of "what are the visualization skills of Python data sets". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.