In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly shows you "what are the powerful data visualization options in Pandas", which is easy to understand and well-organized. I hope it can help you solve your doubts. Let me lead you to study and learn what are the powerful data visualization options in Pandas.
One of the most common pitfalls in the data science industry is to spend hours finding the best algorithms for their projects without spending enough time understanding the data first.
The structured approach to data science and machine learning projects starts with project goals. Some meaningful information can be inferred from the same set of data points. Based on what we are looking for, we need to focus on another aspect of the data. Once we have a clear goal, we should start thinking about the data points we need. This will enable us to focus on the most relevant information sets and ignore data sets that may not be important.
In real life, most of the time data collected from multiple sources has blank values, typing errors, and other anomalies. It is critical to clear the data before any data analysis is carried out.
In this article, I'll discuss five powerful data visualization options that provide an immediate sense of data characteristics. Even before formal modeling or hypothetical testing tasks, performing EDA can convey a great deal of information about the relationship between data and features.
Step 1-We will import the pandas, matplotlib, seaborn, and NumPy packages, which we will use for analysis. We need scatter plot, autocorrelation graph, lag graph and parallel graph.
Import pandas as pd import numpy as np import matplotlib.pyplot as plt from pandas.plotting import autocorrelation_plot import seaborn as sns from pandas.plotting import scatter_matrix from pandas.plotting import autocorrelation_plot from pandas.plotting import parallel_coordinates from pandas.plotting import lag_plot
Step 2-in the Seaborn package, there is a built-in small dataset. We will use "mpg", "tips" and "attention" data for visualization. The dataset is loaded using the load_dataset method in seaborn.
"Download the datasets used in the program" CarDatabase= sns.load_dataset ("mpg") MealDatabase= sns.load_dataset ("tips") AttentionDatabase= sns.load_dataset ("attention")
Hexagonal split-box diagram (hexpin)
We often use scatter plots to quickly grasp the relationships between variables. As long as there are no densely populated data points in the picture, it is very helpful to gain an insight. In the following code, we draw a scatter plot between the "Horsepower" and "Acceleration" data points in the "mpg" dataset.
Plt.scatter (CarDatabase.acceleration, CarDatabase.horsepower,marker= "^") plt.show ()
Because of the dense distribution of points in the scatter graph, it is difficult to obtain meaningful information from it.
Hexpins is a good alternative to solving the overlapping point scatter graph. Each point is not plotted separately in the hexbin diagram. In the following code, we draw a hexbin between "Horsepower" and "Acceleration" with the same dataset.
CarDatabase.plot.hexbin (x-ray transportation, y-bike horsepowerhouse, gridsize=10,cmap= "YlGnBu") plt.show ()
The values in the range of "Horsepower" and "Acceleration" can be clearly inferred in the hexpin diagram, and there is a negative linear relationship between variables. The size of the hexagon depends on the mesh size parameter.
Thermal map (Heatmaps)
Heat is my personal favorite to look at the correlation between different variables. Those who have followed me in the media may have noticed that I use it a lot. In the following code, we will calculate the pairwise correlation between all variables in the seaborn "mpg" dataset and plot it as a thermal map.
Thermal map is my personal favorite to look at the correlation between different variables. Those who have followed me in the media may have noticed that I use it a lot. In the following code, we will calculate the pairwise correlation between all variables in the seaborn "mpg" dataset and plot it as a thermal map.
Sns.heatmap (CarDatabase.corr (), annot=True, cmap= "YlGnBu") plt.show ()
We can see that "cylinders" and "horsepower" are closely and positively related (as expected in cars), while weight is inversely proportional to acceleration. It only takes a few lines of code to quickly understand the indicative relationship between all the different variables.
Autocorrelation graph (Autocorrelation)
The autocorrelation graph is a quick touchstone test to determine whether the data points are random. If the data point follows a trend, then one or more autocorrelations will be significantly non-zero. The dotted line in the figure shows 99% confidence interval. In the following code, we are checking to see if the total bill amount in the "tips" database is random.
Autocorrelation_plot (MealDatabase.total_bill) plt.show ()
We can see that the autocorrelation graph is very close to zero in all time delays, which indicates that the total _ bill data points are random.
When we draw the autocorrelation graph of the data points in a particular order, we can see that the graph is significantly non-zero.
Data = pd.Series (np.arange)) autocorrelation_plot (data) plt.show ()
Lag diagram (Lag)
Lag graphs also help to verify whether the dataset is a random set of values or follows a trend. When drawing a lag map of the "total_bills" value of the "tips" dataset, as in the autocorrelation graph, the lag graph indicates that it is random data and there are values everywhere.
Lag_plot (MealDatabase.total_bill) plt.show ()
When we delay drawing a sequence of non-random data, as shown in the following code, we get a smooth line.
Data = pd.Series (np.arange (- 1200np.pipime300)) lag_plot (data) plt.show ()
Parallel coordinate diagram (Parallel coordinates)
It has always been a challenge to surround our brains and visualize them not just three-dimensional data. It is useful to draw parallel coordinates of a high-dimensional dataset. Each size is represented by a vertical line.
In parallel coordinates, the "N" isometric vertical line represents the "N" dimension of the dataset. The position of the vertex on the nth axis corresponds to the nth coordinate of the point.
Let's consider a small sample of data that has five widgets and five features of large widgets.
The vertical line represents each function of the widget. A series of consecutive segments represent the eigenvalues of "small" and "large" widgets.
The following code draws the parallel coordinates of the "attention" dataset in seaborn. Notice that the points of the cluster look closer.
Parallel_coordinates (AttentionDatabase, "attention", color= ('# 556270,'# C7F464')) plt.show () is all the content of this article entitled "what are the powerful data visualization options in Pandas". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.