How to use Python to analyze the correlation of data 04/27 Update SLTechnology News&Howtos

How to use Python to analyze the correlation of data

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about how to use Python to analyze the correlation of the data. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

In data analysis, the data we use are often not one-dimensional, and these data are more difficult to analyze, because we need to consider the relationship between dimensions. The analysis of these dimensional relationships needs to be measured by some methods, and correlation analysis is one of them. This paper uses python to explain the correlation analysis of the data.

Before the correlation analysis, we need to introduce several concepts, one is the dimension, the second is the covariance, the third is the correlation coefficient. First of all, let's look at the dimension. Take figure 1 as an example, this is an employee information statistics table, where there are n employees, employee 1, employee 2, employee n, and each employee has five attributes, namely height, weight, age, length of service and education. The information of each employee is an observation, which is also called a sample. In this paper, an attribute of each employee is called an index, which is also called a variable, dimension or attribute. So there are n observations and five dimensions in this graph.

Figure 1. Employee information table

The covariance is defined as E {[X-E (X)] [Y-E (Y)]}, denoted as Cov (X, Y), that is, the expectation of the product of the difference between the two dimensions and their respective expectations. Expectation is usually the mean in discrete data. For example, in figure 1, height represents X, weight represents Y, E (X) is the mean of height, and E (Y) is the mean of weight. The covariance is calculated by using the difference between them and E (X) and E (Y), respectively. The correlation coefficient is Cov (X, Y) / [σ (X) σ (Y)], denoted as ρ XY, where σ (X) and σ (Y) denote the standard deviation of X and Y respectively, so the correlation coefficient is the product of the covariance of the two variables divided by their standard deviation. Similarly, if an observation has p dimensions and calculates the covariance between each dimension and all dimensions, it will form a pxp matrix, and each number of the matrix is the covariance between its corresponding dimensions, which is called the covariance matrix. The covariance matrix can be obtained by further calculation of the covariance matrix according to the above method.

Let's explain the correlation analysis with python code.

First of all, the data set, the data used in this paper is from the drawing library seaborn data, is a very famous Iris data, the way to obtain is very simple, you can execute the following code. Here is a question to remind you. Some people will make errors in load_dataset and cannot read the data because the dataset iris does not exist. This may be due to the problem of the seaborn version. If you encounter this situation, you can go to seaborn's GitHub data website to download the data by yourself, the URL is https://github.com/mwaskom/seaborn-data, and you can extract the downloaded data to the seaborn-data folder. This folder is usually under the seaborn installation directory or the current working directory.

Import seaborn as sns data = sns.load_dataset ('iris') df = data.iloc [:,: 4] # take the first four columns of data

The dataset used this time has 150 rows and 5 columns, and we only use the first four columns of data. A sample dataset is shown in figure 2.

Figure 2. Sample dataset

Next, let's do correlation analysis.

First of all, let's do a relatively simple analysis, that is, to analyze the correlation between the first and third columns in this data set, that is, the relationship between the two columns sepal_length and petal_length. Here we can use numpy, scipy and pandas. The first is numpy.

Import numpy as np X = df ['sepal_length'] Y = df [' petal_length'] result1 = np.corrcoef (X, Y)

The resulting result1 result is a two-dimensional matrix, as shown in figure 3.

Figure 3. Result1 calculation results

The value on the principal diagonal of the matrix is 1 (the principal diagonal is the diagonal from the upper left corner to the lower right corner). This is because the value of the principal diagonal is the correlation between each observation and its own, so it is 1. After all, X observation 1X is equal to 1 times itself. The other non-1 numbers in figure 3 are correlation values, and there are two values, which are equal, because they represent ρ XY and ρ YX, respectively, and their values are equal. By the same token, we can find the relationship between the four dimensions in df. The code is as follows, where rowvar represents as a dimension.

Result2 = np.corrcoef (df, rowvar=False)

The result is shown in figure 4.

Figure 4. Result2 calculation results

Figure 4 is a 4x4 matrix with a total of 16 data, representing the relationship between each dimension and other dimensions (including each dimension and itself), the principal diagonal is 1, and the other numbers are symmetric with respect to the principal diagonal.

Next, we use scipy for analysis. The code is as follows.

Import scipy.stats as ss result3 = ss.pearsonr (X, Y)

This result is (0.8717537758865831, 1.0386674194498099e-47), which returns two numbers, the first number is the correlation value between X and Y, its value is the same as the previous numpy calculation result, the second is the unrelated probability, that is, we often say p value in statistics, but this value refers to the irrelevant probability, that is, the smaller the value, the more relevant, our value here is very small. It means that the linear correlation between the two is relatively large. Of course, if the correlation value is 1, then the p value is 0. There is no way to calculate the correlation matrix in scipy.

Finally, there is the pandas method.

Because the previous df itself is the DataFrame format of pandas, we can use it directly. The code is as follows.

Result4 = X.corr (Y) result5 = df.corr ()

The result4 result is 0.871753775886583 and the result is shown in figure 5. These two results are the same as the previous results.

Figure 5. Result5 calculation results

The next step is drawing. For the analysis of correlation, there are generally two common graphics, one is the scatter diagram, the other is the thermal map. The distribution and trend of each coordinate point can be seen clearly in the scatter chart. For data analysts, it can more intuitively understand the relationship between the data of each dimension, but this method also has some disadvantages, that is, it is not suitable for a large amount of data, because the amount of data is too large, the speed of generating pictures will be very slow, and too many pictures are not conducive to observation. The thermal map is more from the numerical value or color to accurately describe the relationship of each dimension, its transmission of less information, but more suitable for a large amount of data. First of all, let's introduce the scatter chart.

You can use seaborn or pandas to generate scatter plots. The code for seaborn is as follows.

Sns.pairplot (df) sns.pairplot (df, hue = 'sepal_width')

The result of the first line of code, as shown in figure 6, is a large picture with 16 sub-graphs, each of which is a correlation diagram of each dimension and some other dimension, in which the graph on the main diagonal is the data distribution histogram of each dimension. The second line of code draws the same graph, but uses the data of the sepal_width dimension as the standard to color each data point, and the result is shown in figure 7. As can be seen from the figure, the column of sepal_width data has 23 different values, each of which has a color, so the resulting graph is colored.

Figure 6. General correlation diagram drawn by seaborn

Figure 7. Correlation diagram drawn by seaborn based on a column of data

Another method of drawing is to use pandas, the code is as follows.

Import pandas as pd pd.plotting.scatter_matrix (df, figsize= (12712), range_padding=0.5)

As shown in figure 8, you can see that the picture drawn with pandas is roughly the same as that of seaborn, but the customizability and fineness of the picture are still slightly lower, so it is generally recommended to use seaborn.

Figure 8. Correlation diagram generated by pandas

Finally, there is the thermal map. The code is as follows.

Import matplotlib.pyplot as plt figure, ax= plt.subplots (figsize= (12,12)) sns.heatmap (df.corr (), square=True, annot=True, axax=ax)

There was also a small problem at the beginning of writing this code, as shown in figure 9. Figure 9 in the first and last line of the sub-graph is only part of the display, while the other sub-images are fully displayed, this is a bug of matplotlib, because seaborn is a matplotlib-based library, so as long as upgrade matplotlib on the line, the author's version of matplotlib is 3.1.1 at the beginning, has now been upgraded to 3.2.2, this bug has been repaired. The normal figure is shown in figure 10. In the second line of code, square=True indicates whether each subgraph is displayed as a square, and here it is set to True,annot=True to indicate whether the value of each subgraph is displayed in the graph, and here it is also set to True.

Figure 9. Heat map generated by the old version of matplotlib

Figure 10. Thermal map generated by the new version of matplotlib

From data calculation to visualization, this paper introduces a variety of methods to find the correlation between multi-dimensional data with python, and we can choose the corresponding methods according to our own needs.

The above is the editor for you to share how to use Python for data correlation analysis, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.