In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article shows you how to compare the seven Python data chart tools, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
The scientific stack of Python is quite mature, and there are related modules in various application scenarios, including machine learning and data analysis. Data visualization is an important part of discovering data and displaying results, but in the past, compared with tools such as R, the development still lags behind.
Fortunately, many new Python data visualization libraries have emerged in the past few years, bridging some of the gaps. Matplotlib has become the main library for de facto data visualization, in addition to many other libraries, such as vispy,bokeh, seaborn, pyga, folium, and networkx, some of which are built on matplotlib and others.
The editor will use these libraries to visualize the data based on a piece of real data. Through these comparisons, we expect to understand the scope of each library and how to make better use of the entire Python data visualization ecosystem.
We have built an interactive course at Dataquest to teach you how to use Python's data visualization tools. If you are going to study further, you can click here.
Explore the dataset
Before we explore the visualization of data, let's take a quick look at the dataset we are going to deal with. The data we are going to use comes from openflights. We are going to use airline data sets, airport data sets, airline data sets. Among them, each row of the path data corresponds to the flight path between the two airports; each row of the airport data corresponds to a certain airport in the world, and gives the relevant information; each line of the airline data gives each airline.
First, let's read the data:
# Import the pandas library. Import pandas # Read in the airports data. Airports = pandas.read_csv ("airports.csv", header=None, dtype=str) airports.columns = ["id", "name", "city", "country", "code", "icao", "latitude", "longitude", "altitude", "offset", "dst", "timezone"] # Read in the airlines data. Airlines = pandas.read_csv ("airlines.csv", header=None, dtype=str) airlines.columns = ["id", "name", "alias", "iata", "icao", "callsign", "country", "active"] # Read in the routes data. Routes = pandas.read_csv ("routes.csv", header=None, dtype=str) routes.columns = ["airline", "airline_id", "source", "source_id", "dest", "dest_id", "codeshare", "stops", "equipment"]
There are no column preferences for this data, so we add column preferences by assigning the column attribute. We want to read each column as a string because this simplifies the subsequent steps of comparing different data frameworks with row id as matches. We set the value of the dtype attribute when reading the data to achieve this purpose.
We can take a quick look at the data framework of each dataset.
Airports.head ()
Airlines.head ()
Routes.head ()
We can do many different and interesting explorations on each individual data set, but we can only gain by combining them with analysis. Pandas will help us analyze the data because it can effectively filter weights or apply some functions through it. We will delve into several interesting weight factors, such as analyzing airlines and routes.
So before that, we need to do some data cleaning.
Routes = routes [routes ["airline_id"]! = "\ N"]
This command ensures that we only have numeric data in the airline_id column.
Make a bar chart
Now that we understand the structure of the data, we can go a step further and continue to explore the problem. First, we will use the tool matplotlib, which is a relatively low-level library of trace points in the Python stack, so it needs to type a few more commands than other toollibraries to make a good-looking curve. On the other hand, you can use matplotlib to make almost any curve because it is very flexible, and the cost of flexibility is that it is very difficult to use.
First of all, we make a bar chart to show the route length distribution of different airlines. A bar chart divides the lengths of all routes into different ranges, and then counts the routes that fall into different ranges. From which we can know which airlines have long routes and which ones have short routes.
In order to achieve this, we need to first calculate the length of the route. We will use the distance formula. We will use the cosine semi-positive vector distance formula to calculate the distance between the two points depicted by latitude and longitude.
Import math def haversine (lon1, lat1, lon2, lat2): # Convert coordinates to floats. Lon1, lat1, lon2, lat2 = [float (lon1), float (lat1), float (lon2), float (lat2)] # Convert to radians from degrees. Lon1, lat1, lon2, lat2 = map (math.radians, [lon1, lat1, lon2, lat2]) # Compute distance. Dlon = lon2-lon1 dlat = lat2-lat1 a = math.sin (dlat/2) * * 2 + math.cos (lat1) * math.cos (lat2) * math.sin (dlon/2) * * 2c = 2 * math.asin (math.sqrt (a)) km = 6367 * c return km
Then we can use a function to calculate the one-way distance between the starting airport and the terminal airport. We need to get the source_id and dest_id corresponding to the airport data framework from the route data framework, and then match the id column of the airport data set, and then all we have to do is calculate. This function goes like this:
Def calc_dist (row): dist = 0 try: # Match source and destination to get coordinates. Source = airports [airports ["id"] = = row ["source_id"] .iloc [0] dest = airports [airports ["id"] = = row ["dest_id"]] .iloc [0] # Use coordinates to compute distance. Dist = haversine (dest ["longitude"], dest ["latitude"], source ["longitude"], source ["latitude"]) except (ValueError, IndexError): pass return dist
This function will report an error if the source_id and dest_id columns do not have valid values. So we need to add a try/catch module to capture this invalid situation.
* We will use pandas to apply the function of distance calculation to the routes data framework. This will give us a pandas sequence that contains the lengths of all routes, all of which are in kilometers.
Route_lengths = routes.apply (calc_dist, axis=1)
Now that we have a sequence of route distances, we will create a bar chart that will classify the data into the corresponding range, and then count how many routes fall into different ranges:
Import matplotlib.pyplot as plt matplotlib inline plt.hist (route_lengths, bins=20)
We use import matplotlib.pyplot as plt to import the matplotlib trace point function. Then we use% matplotlib inline to set the matplotlib to trace points in the notebook of ipython, and finally we use plt.hist (route_lengths, bins=20) to get a bar chart. As we can see, airlines tend to run short-range routes rather than long-distance ones.
Use seaborn
We can use seaborn to do similar dots. Seaborn is a high-level library of Python. Seaborn is built on the basis of matplotlib and does some types of drawing points, which are often related to simple statistical work. We can use the distplot function to draw a histogram based on the expectation of a core probability density. The density expectation of a core is a curve-essentially a curve that is a little smoother than a bar chart and makes it easier to see the patterns in it.
Import seaborn seaborn.distplot (route_lengths, bins=20)
As you can see, seaborn also has a prettier default style. Seaborn does not contain a version corresponding to each version of matplotlib, but it is indeed a good quick tracing tool, and it can help us understand the meaning behind the data better than matplotlib's default chart. Seaborn is also a good library if you want to do some more in-depth statistical work.
Bar chart
The bar chart is also good, but sometimes we need the average route length of the airline. At this point, we can use bar charts-each route will have a separate status bar that shows the average length of the airline route. From which we can see which is the domestic airline and which is the international airline. We can use pandas, a python data analysis library, to understand the average route length of each airline.
Import numpy # Put relevant columns into a dataframe. Route_length_df = pandas.DataFrame ({"length": route_lengths, "id": routes ["airline_id"]}) # Compute the mean route length per airline. Airline_route_lengths = route_length_df.groupby ("id"). Sort by length so we can make a better chart (numpy.mean) # Sort by length so we can make a better chart. Airline_route_lengths = airline_route_lengths.sort ("length", ascending=False)
We first use route length and airline id to build a new data framework. We split route_length_df into groups based on airline_id to build a general data framework for each airline. Then we call pandas's aggregate function to get the mean of the length column in the airline data framework, and then reorganize each obtained value into a new data model. The data models are then sorted so that the airlines with the most routes are photographed in the front.
This allows you to draw the results using matplotlib.
Plt.bar (range (airline_route_lengths.shape [0]), airline_route_lengths ["length"])
Matplotlib's plt.bar method plots based on the average airline length of each data model (airline_route_lengths ["length"]).
The problem is that it is not easy for us to see which airline has what length of route. To solve this problem, we need to be able to see the axis label. It's a little difficult, after all, there are so many airlines. One way to make the problem easier is to make the chart interactive so that you can zoom in and out to view the axis label. We can use the bokeh library to do this-it makes it easy to interact and make scalable charts.
To use booked, we need to preprocess the data first:
Def lookup_name (row): try: # Match the row id to the id in the airlines dataframe so we can get the name. Name = airlines ["name"] [airlines ["id"] = = row ["id"] .iloc [0] except (ValueError, IndexError): name = "" return name # Add the index (the airline ids) as a column. Airline_route_lengths ["id"] = airline_route_lengths.index.copy () # Find all the airline names. Airline_route_lengths ["name"] = airline_route_lengths.apply (lookup_name, axis=1) # Remove duplicate values in the index. Airline_route_lengths.index = range (airline_route_lengths.shape [0])
The above code takes the name of each column in the airline_route_lengths and adds it to the name column, where the name of each airline is stored. We also add it to the id column for lookup (the apply function does not pass index).
*, we reset the index sequence to get all the special values. Without this step, Bokeh cannot function properly.
Now, we can continue to talk about the chart problem:
Import numpy as np from bokeh.io import output_notebook from bokeh.charts import Bar, show output_notebook () p = Bar (airline_route_lengths, 'name', values='length', title= "Average airline route lengths") show (p)
Create a virtual background with output_notebook and draw a picture in iPython's notebook. Then, a bar chart is made using data frames and specific sequences. *, the display function will display the figure.
This diagram is not really an image-- it's a JavaScript plug-in. Therefore, what we show below is a screenshot, not a real table.
With it, we can zoom in to see which flight has the longest flight route. The above image makes the tables look crowded together, but when zoomed in, it looks much more convenient.
Horizontal bar chart
Pygal is a data analysis library that can quickly produce attractive tables. We can use it to decompose routes by length. First of all, our routes are divided into short, medium and long distances, and their respective percentages are calculated in route_lengths.
Long_routes = len ([k for k in route_lengths if k > 10000]) / len (route_lengths) medium_routes = len ([k for k in route_lengths if k]
< 10000 and k >2000]) / len (route_lengths) short_routes = len ([k for k in route_lengths if k < 2000]) / len (route_lengths)
Then we can draw each one into a bar chart in the horizontal bar chart of Pygal:
Import pygal from IPython.display import SVG chart = pygal.HorizontalBar () chart.title = 'Long, medium, and short routes' chart.add (' Long', long_routes * 100) chart.add ('Medium', medium_routes * 100) chart.add (' Short', short_routes * 100) chart.render_to_file ('routes.svg') SVG (filename='routes.svg')
First, we create an empty graph. Then, we add elements, including the title and bar chart. Each bar chart shows how often this type of route is used by a percentage value (the * value is 100).
*, we render the chart into a file, load and display the file using the SVG function of IPython. This diagram looks much better than the default matplotlib diagram. But in order to make this diagram, we have to write a lot more code. Therefore, Pygal may be more suitable for making small display charts.
Scatter plot
In the scatter chart, we can compare the data vertically. We can do a simple scatter chart to compare the length of the airline's id number with the airline's name:
Name_lengths = airlines ["name"] .apply (lambda x: len (str (x) plt.scatter (airlines ["id"] .astype (int), name_lengths)
First, we use pandasapplymethod to calculate the length of each name. It will find the number of characters in the name of each airline. Then, we use matplotlib to make a scatter chart to compare the length of aerial id. When we draw, we convert theidcolumn of airlines to an integer type. It won't work if we don't do this, because it requires a number on the x-axis. We can see that a lot of long names appeared in the earlier id. This may mean that airlines tend to have longer names before they are established.
We can use seaborn to verify this intuition. The Seaborn enhanced version of the scatter chart, a joint dot, shows that the two variables are related and have a similar distribution.
Data= pandas.DataFrame ({"lengths": name_lengths, "ids": airlines ["id"] .astype (int)}) seaborn.jointplot (x = "ids", y = "lengths", data=data)
The figure above shows that the correlation between the two variables is ambiguous-the square of r is low.
Static maps
Our data are naturally suitable for mapping-airports have longitude and latitude pairs, as well as for departure and destination airports.
The picture shows all the airports in the world. You can do this with basemap that extends to matplotlib. This allows you to draw world maps and add points, and it's easy to customize.
# Import the basemap package from mpl_toolkits.basemap import Basemap # Create a map on which to draw. We're using a mercator projection, and showing the whole world. M = Basemap (projection='merc',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c') # Draw coastlines, and the edges of the map. M.drawcoastlines () m.drawmapboundary () # Convert latitude and longitude to x and y coordinates x, y = m (list (airports ["longitude"] .astype (float)), list (airports ["latitude"] .astype (float)) # Use matplotlib to draw the points onto the map. M.scatter (XMagneyPhone1) # Show the plot. Plt.show ()
In the above code, first draw a world map with mercator projection. Mercator projection is to project the drawing of the whole world onto a two-bit surface. Then, draw the airport with red dots on the map.
The problem with the map above is that it is difficult to find where each airport is-they are merging the city with red spots in the dense area of the airport.
Just like the focus is not clear, there is an interactive mapping library, folium, which can enlarge the map to help us find individual airports.
Import folium # Get a basic world map. Airports_map = folium.Map (location= [30,0], zoom_start=2) # Draw markers on the map. For name, row in airports.iterrows (): # For some reason, this one airport causes issues with the map. If row ["name"]! = "South Pole Station": airports_map.circle_marker (location= [row ["latitude"], row ["longitude"]], popup=row ["name"]) # Create and show the map. Airports_map.create_map ('airports.html') airports_map
Folium uses leaflet.js to make fully interactive maps. You can click on each airport to see the name in the pop-up box. A screenshot is displayed above, but the actual map is more impressive. Folium also allows a wide range of modification options to do better tagging, or to add more things to the map.
Draw an arc
It's cool to see all the air routes on the map, but fortunately, we can use basemap to do this. We will draw an arc to connect all the airport departure and destination. Each arc wants to show a route for each segment. Unfortunately, it would be a mess to show that all the lines have too many routes. Instead, we only realise the first 3000 routes.
# Make a base map with a mercator projection. Draw the coastlines. M = Basemap (projection='merc',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c') m.drawcoastlines () # Iterate through the first 3000 rows. For name, row in routes [: 3000] .iterrows (): try: # Get the source and dest airports. Source = airports [airports ["id"] = = row ["source_id"] .iloc [0] dest = airports [airports ["id"] = = row ["dest_id"]] .iloc [0] # Don't draw overly long routes. If abs (float (source ["longitude"])-float (dest ["longitude"]) < 90: # Draw a great circle between source and dest airports. M.drawgreatcircle (float (source ["longitude"]), float (source ["latitude"), float (dest ["longitude"]), float (dest ["latitude"]), linewidth=1,color='b') except (ValueError, IndexError): pass # Show the map. Plt.show ()
The above code will draw a map and then draw a line on the map. We add a write filter to prevent long routes that interfere with other routes.
Draw a network diagram
The final exploration we will do is to draw a map of the airport network. Each airport will be a node in the network, and if there is a route between the two points, the connection between the nodes will be drawn. If there are multiple routes, the weight of the line is added to show more airport connections. The networkx library will be used to do this.
First of all, the weight of the connection between airports is calculated.
# Initialize the weights dictionary. Weights = {} # Keep track of keys that have been added once-- we only want edges with a weight of more than 1 to keep our network size manageable. Added_keys = [] # Iterate through each route. For name, row in routes.iterrows (): # Extract the source and dest airport ids. Source = row ["source_id"] dest = row ["dest_id"] # Create a key for the weights dictionary. # This corresponds to one edge, and has the start and end of the route. Key = "{0} _ {1}" .format (source, dest) # If the key is already in weights, increment the weight. If key in weights: weights [key] + = 1 # If the key is in added keys, initialize the key in the weights dictionary, with a weight of 2. Elif key in added_keys: weights [key] = 2 # If the key isn't in added_keys yet, append it. # This ensures that we aren't adding edges with a weight of 1. Else: added_keys.append (key)
Once the above code runs, the weight dictionary contains every connection with a weight greater than or equal to 2 between the two airports. So any airport with two or more connected routes will be displayed.
# Import networkx and initialize the graph. Import networkx as nx graph = nx.Graph () # Keep track of added nodes in this set so we don't add twice. Nodes = set () # Iterate through each edge. For k, weight in weights.items (): try: # Split the source and dest ids and convert to integers. Source, dest = k.split ("_") source, dest = [int (source), int (dest)] # Add the source if it isn't in the nodes. If source not in nodes: graph.add_node (source) # Add the dest if it isn't in the nodes. If dest not in nodes: graph.add_node (dest) # Add both source and dest to the nodes set. # Sets don't allow duplicates. Nodes.add (source) nodes.add (dest) # Add the edge to the graph. Graph.add_edge (source, dest, weight=weight) except (ValueError, IndexError): pass pos=nx.spring_layout (graph) # Draw the nodes and edges. Nx.draw_networkx_nodes (graph,pos, node_color='red', node_size=10, alpha=0.8) nx.draw_networkx_edges (graph,pos,width=1.0,alpha=1) # Show the plot. Plt.show () the above is how the seven Python data chart tools are compared. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.