How to use Python for correlation Analysis 07/15 Update SLTechnology News&Howtos

How to use Python for correlation Analysis

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about how to use Python for correlation analysis, many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

1. Is relevance the same thing as cause and effect?

Correlation is not equal to cause and effect. Using x1 and x2 as two variables to explain, correlation means that x1 and x2 are logically juxtaposed, while causality can be interpreted as the logical relationship of x2 (or x2 because of x2) because of x1. The two are completely different.

An operational example is used to illustrate the relationship between the two: when doing commodity promotion activities, they are usually sold at a lower price in order to achieve higher commodity sales; with the improvement of commodity sales, it also brings greater pressure to the offline logistics and distribution system, which usually leads to an increase in the amount of damaged goods.

In this case, there is not a causal relationship between the low price of goods and the increase in the volume of damage, that is, it cannot be said that the amount of damage increases because of the low price of goods; the real relationship between the two is based on the background of promotion. low price and damage are based on promotion.

The true value of relevance is not used to analyze "why", but a way to describe the real causes behind unexplained problems through relevance. The real value of relevance is to know what it is, that is, no matter what factors affect the results, the final rule is that the two will increase or decrease together.

It is still the above case, through the correlation analysis, we can know that the low commodity price is accompanied by the increase in breakage, which means that when the price is low (usually doing sales activities, it is also possible that product quality problems, logistics and distribution problems, packaging problems, etc.), we think that the amount of damage may also increase. But what causes the increase in the amount of damage cannot be obtained through correlation.

two。 Does a low correlation coefficient mean irrelevance?

Is R (correlation coefficient) low just irrelevant? Actually, no.

The value of R can be negative, and the correlation of R color 0.8 is higher than that of R color 0.5. The negative correlation only means that the growth trend of the two variables is opposite, so we need to look at the absolute value of R to judge the strength of the correlation.

Even if the absolute value of R is low, it does not necessarily mean that the correlation between variables is low, because the correlation only measures the linear correlation between variables, in addition to the linear relationship between variables, it also includes exponential relationship, polynomial relationship, power relation and so on. The correlation of these "non-linear correlation" is not within the measurement range of R (correlation analysis).

3. Code practice: Python correlation analysis

In this example, Numpy will be used for correlation analysis. The source file data5.txt is located in "attachment-chapter3". Attachment download address:

Http://www.dataivy.cn/book/python_book_v2.zip

Import numpy as np # Import library data = np.loadtxt ('data5.txt', delimiter='\ t') # read data file x = data [:,:-1] # split independent variable correlation_matrix = np.corrcoef (x, rowvar=0) # correlation analysis print (correlation_matrix.round (2)) # printout correlation results

The implementation process in the example is as follows:

Import the Numpy library first

Use the loadtxt method of Numpy to read the data file, which is separated by tab

Matrix slicing, slicing independent variables for correlation analysis

Use the corrcoef method of Numpy to do correlation analysis, and analyze the column through the parameter rowvar = 0.

Print the output correlation matrix, using the round method to retain 2 decimal places. The results are as follows:

[1.-0.04 0.27-0.05 0.21-0.05 0.19-0.03-0.02] [- 0.041.-0.01 0.73-0.01 0.62 0. 0.48 0.51] [0.27-0.01 1.-0.01 0.720. 0.65 0.01 0.02] [- 0.05 0.73-0.01 1.0.01 0.88 0.01 0.7 0.72] [0.21-0.01 0.72 0.01 0.01 1.0.02 0.91 0.03 0.03] [- 0.05 0.62 0. 0.88 0.02 1. 0.03 0.83 0.82] [0.19 0. 0.65 0.01 0.91 0.03 1. 0.03 0.03] [- 0.03 0.48 0.01 0.7 0.03 0.83 0.03 1. 0.71] [- 0.02 0.51 0.02 0.72 0.03 0.71 1.]

The left and top of the correlation matrix are relative variables, with columns 1 to 9 from left to right and from top to bottom. As can be seen from the results:

The correlation between column 5 and column 7 is high, and the coefficient is 0.91.

The correlation between column 4 and column 6 is high, and the coefficient is 0.88.

The correlation between column 8 and column 6 is high, and the coefficient is 0.83.

In order to better show the correlation results, we can display the image with Matplotlib. The code is as follows:

Fig = plt.figure () # call figure to create a drawing object ax= fig.add_subplot (111l) # set up a submesh and add a submesh object hot_img = ax.matshow (np.abs (correlation_matrix), vmin=0, vmax=1) # to draw a heat map with a range of values from 0 to 1 fig.colorbar (hot_img) # generate a color gradient bar ticks = np.arange (0,9,1) # for the heat map Step 1 ax.set_xticks (ticks) # generate x-axis scale ax.set_yticks (ticks) # set y-axis scale names = ['x'+ str (I) for i in range (x.shape [1])] # generate axis label text ax.set_xticklabels (names) # generate x-axis label ax.set_yticklabels (names) # generate y-axis label

The functions of the above code have been noted in the comments. There are the following points to note:

Because the correlation result looks at the size of the absolute value, it is necessary to take the absolute value operation on the correlation_matrix, and its corresponding range will become [0,1].

Since there are no column headings in the original data, list-derived generation from x0 to x8 represents the original nine features.

The result of the display is shown in the figure.

It can be seen from the matching color in the image that the brighter the color (the more yellow the color), the higher the correlation result, so there is a yellow diagonal line from the upper left corner to the lower right corner; the brighter columns 5 and 7, 4 and 6, and 8 and 6 correspond to x4 and x6, x3 and x5, x7 and x5, respectively.

In the above process, the key points to be considered are: how to understand the differences between correlation and causality, and how to apply correlation. Correlation analysis can not only be used to analyze the concomitant relationship between different variables, but also can be used to do multicollinearity test.

After reading the above, do you have any further understanding of how to use Python for correlation analysis? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.