In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to use Python to do AQI analysis and visualization". In daily operation, I believe many people have doubts about how to use Python to do AQI analysis and visualization problems. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "how to use Python to do AQI analysis and visualization". Next, please follow the editor to study!
AQI analysis
1. Background information
AOI (Air Quality Index), the air quality index, is used to measure the degree of air cleanliness or pollution. A smaller value means better air quality. In recent years, people pay more and more attention to air quality because of environmental problems. We look forward to using the relevant techniques of data analysis to study and analyze the urban air quality across the country, hoping to solve the following questions:
Which cities have better / worse air quality?
Does the air quality have a certain regularity in the geographical distribution?
Is the air quality of the city related to whether it is near the sea?
What are the main factors affecting air quality?
What is the general level of urban air quality in the country?
Analysis report Preview. GIF
The 2015 air quality index set is now available. The data set contains relevant data from major cities across the country as well as air quality index.
CityAQIPrecipitationGDP
Urban air quality index precipitation urban gross domestic product
LongitudeLatitudeAltitudePopulation Density
Longitude, latitude, altitude, population density
TemperatureCoastalIncineration (10000ton) Green Coverage Rate
Is the temperature near the sea incineration / green rate of 10000 tons?
2. Data analysis process
Before conducting data analysis, we need to be clear about the basic process of data analysis.
3. Read data
Import the required libraries and initialize some settings.
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 import warnings
6 sns.set () 7 plt.rcParams ["font.family"] = "simHei" # used to solve problems that cannot be displayed in Chinese
8 plt.rcParams ["axes.unicode_minus"] = False
9 warnings.filterwarnings ("ignore")
Load dataset
4. Data cleaning
4.1 missing value
The treatment of missing values. You can use the following ways:
Delete missing value
Only applicable to situations where the number of missing is very small.
Fill in missing values
Numerical variable
Mean filling
Median filling
Category variable
Multiplicity filling
As a separate category
Other
First use info () or innull () to see the missing values.
Then use skew () to view the skewness information, and then draw a picture to have a look. Note that distplot () does not support drawing data with null values, so you must first use dropna () to eliminate null values.
As you can see, our original data is a little to the right, because there are only 4 missing values, and the missing number is very small, which can be deleted directly, but this time we filled it with the median.
4.2 outliers
How to find outliers? We can do this in several ways:
Describe ()
Box diagram
3 sigma mode
Other related anomaly detection algorithms
Describe ():
Calling the describe method of the dataframe object will display the statistics of the data and let yourself know the data.
It can be seen that the gap between the maximum values of GDP, Latitude, PopulationDensity and the larger quartile is extremely large, and there is a right deviation phenomenon, that is, there are many great outliers.
3 σ
3 σ is 3 times the standard deviation. according to the characteristics of normal distribution, we can regard the data other than 3 σ as outliers. Take GDP as an example, draw the skewness distribution of GDP:
There is a serious right deviation distribution of the data, that is to say, there are many great outliers, which are obtained by 3 σ method:
Box diagram
Through the box chart, we can see that there are a lot of great outliers, how to judge?
The basis for judging the outliers of box diagrams:
Q1, Q2 and Q3 denote 1 quartile, 2 quartile, 3 quartile, IQR=Q3-Q1, respectively.
If the data is less than Q1-1.5IQR or greater than Q3+1.5IQR, it is an outlier.
There are usually the following ways to deal with exceptions:
Delete outliers (not commonly used)
Treat as missing value
Logarithmic conversion (for right deviation, modeling)
Critical value substitution
Discrete treatment by split-box method (divided into different intervals and mapped into discrete values)
Take logarithmic conversion as an example.
Logarithmic conversion is suitable for data with large outliers, that is, it is suitable for right-deviation distribution, but not for left-deviation distribution.
4.3 duplicate value
The handling of duplicate values is simple. Use duplicated to query duplicate values. The parameter keep has three values: "first", False, and "last". Represents a record that shows the first, all, and last day repeats, respectively.
The cleaned data can be exported directly.
5 data analysis
Air quality sometimes determines whether people go or stay, school choice, employment, settlement, tourism and so on.
First of all, let's look at the best and worst cities.
5.1 the best & worst cities in air quality
The five cities with the best air
First sort by AQI, default ascending order, take the first five records; the city name on the x-axis needs to be rotated 45 °, so it is easy to see.
As can be seen in the above picture, the top five cities with good air quality: 1. Shaoguan, 2. Nanping City, 3. Meizhou City, 4. Keelung City (Taiwan Province), 5. Sanming City. They are all southern cities.
The five cities with the worst air
As can be seen in the above picture, the top five cities with the worst air quality: 1. Beijing, 2. Chaoyang, 3. Baoding City, 4. Jinzhou City, 5. Jiaozuo City. They are all northern cities.
5.2 Air quality in some cities across the country
5.2.1 Air quality classification:
First, we need to define a function, write some if statements, and judge the air quality level by the value of AQI.
Here you need to use the apply function: apply to call our self-built function, and the return value is the return value of the self-built function.
As can be seen from the picture, the air quality of major cities in China is mainly first-class and second-class, third-class is a part, and others are minority.
5.2.2 Distribution of air quality index
Call scatterplot () to draw a scatter chart, which is distinguished by AQI, the parameter palette is a color tone, here it is green to red.
As can be seen from the picture, geographically speaking, the air quality of southern cities is better than that of northern cities, and that of western cities is better than that of eastern cities.
5.3 is the air quality of the city related to whether it is near the sea?
Let's first take a look at the number of coastal and inland cities in this data:
There is no doubt that the number of inland cities is much larger than that of coastal cities. Let's take a look at the distribution of scattered points:
It can be seen from the picture that the air quality in coastal cities is due to inland. But we still have to rely on the data to calculate the average of air quality in groups:
To use the groupby () grouping function
Linhai 79, inland 64. But there is too little information, so let's draw a box diagram and a violin diagram to learn more.
It can be seen from the box chart that the quartile of AQI of coastal cities is lower than that of inland cities, so the air quality of coastal cities is better than that of inland cities. However, the box chart is not obvious for the data distribution density.
Therefore, drawing the violin diagram can not only show the box diagram information, but also show the density of distribution.
We can also combine the violin graph with the clustering dot diagram to see:
Inner=None means to remove the "strings".
Can we conclude that the air quality in coastal cities is generally better than that in inland cities?
Obviously not, our data is only a few hundred, just a sample, can not represent the population, this is the difference between the sample and the population.
So how do you get a reliable conclusion? We need to do a difference test on the sample:
Do a t-test on the two samples to see if there is a significant difference in the average value between coastal cities and inland cities. In the two-sample test, we need to know whether the variance of the two samples is the same before we can carry out the later t-test.
First, import the related library, define variables, and test the homogeneity of stats.levene () variance. Return two values: the first is the statistics do not look, look at the second p value is 0.77, indicating that the acceptance of the original hypothesis, the variance is homogeneous (the original hypothesis: the two sample variances are equal, the alternative hypothesis: the variance is different), we can proceed to the next step.
In the t-test, whether the variance of the two samples is equal has an impact on the results! ttest_ind (): two independent samples t-test, the p value of the returned result is only 0.007, which is very small, rejecting the original hypothesis (the two samples are not equal).
From the negative statistics, we can see that inland is greater than coastal. How to calculate it? The t-test of two independent samples provided in stats is a bilateral test (= or ≠), but now we want a relation greater than less than (unilateral test), so we need to calculate p-value: stats.t.sf (), sf=1-cdf,cdf is the cumulative distribution function, sf is the residual function, degree of freedom df. The p value is 0.99666, which means that the smaller the coastal.
So far, we have more than 99% chance that the air quality of coastal cities is generally better than that of inland cities.
5.4 what are the main factors that affect air quality?
Does high population density lead to low air quality?
Can the high greening rate improve the air quality?
First draw a scatter graph matrix with pairplot () and take three columns of data
For the drawing scatter chart of different variables, the drawing histogram of the same variable only represents the quantity. The correlation between variables can not be clearly seen from the above figure, and we need to understand it by calculating the correlation coefficient.
The DataFrame object provides a method to calculate the correlation coefficient, which can be directly data.corr ().
Then visualize the data and present the data more clearly:
Result statistics
It can be seen from the results that the air quality index is mainly affected by rainfall (- 0.40) and latitude (0.55).
The more rainfall, the better the air quality.
The lower the latitude, the better the air quality.
In addition, we can find some other obvious details:
GDP (gross urban product) is positively correlated with Incineration (incineration volume).
There is a positive correlation between Temperature (temperature) and Precipitation (rainfall).
There is a negative correlation between Temperature (temperature) and Latitude (latitude).
Longitude (longitude) was negatively correlated with Altitude (altitude).
Latitude (latitude) is negatively correlated with Precipitation (rainfall).
There was a negative correlation between Temperature (temperature) and Altitude (altitude).
Altitude (altitude) is negatively correlated with Precipitation (rainfall) (- 0.32).
At this point, the study on "how to do AQI analysis and visualization with Python" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.