Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the three ways of data correlation analysis by Python

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

In this article, the editor introduces in detail "what are the three ways of data correlation analysis by Python" with detailed content, clear steps and proper handling of details. I hope that this article "what are the three ways of data correlation analysis by Python" can help you solve your doubts.

Correlation realization

Statistics and data science usually focus on the relationship between two or more variables (or characteristics) of a data set. Each data point in the dataset is an observation, characterized by the attributes or attributes of those observations.

Here are three main ways to calculate the correlation:

Pearson's r

Spearman's rho

Kendall's tau

NumPy correlation calculation

Np.corrcoef () returns the Pearson correlation coefficient matrix.

Import numpy as npx = np.arange (10, 20) xarray ([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]) y = np.array ([2, 1, 4, 5, 8, 12, 18, 25, 96, 48]) yarray ([2, 1, 4, 5, 8, 12, 18, 25, 96, 48]) r = np.corrcoef (x, y) rarray ([1., 0.75864029], [0.75864029] 1.]])

SciPy correlation calculation import numpy as npimport scipy.statsx = np.arange (10,20) y = np.array ([2,1,4,5,8,12,18,25,96,48]) scipy.stats.pearsonr (x, y) # Pearson's r (0.7586402890911869,0.010964341301680832) scipy.stats.spearmanr (x, y) # Spearman's rhoSpearmanrResult (correlation=0.9757575757575757, pvalue=1.4675461874042197e-06) scipy.stats.kendalltau (x, y) # Kendall's tauKendalltauResult (correlation=0.911111111111111, pvalue=2.9761904761904762e-05)

When testing hypotheses, you can use p values in statistical methods. P value is an important measure, which needs to be explained by an in-depth understanding of probabilities and statistics.

Scipy.stats.pearsonr (x, y) [0] # Pearson's r0.7586402890911869scipy.stats.spearmanr (x, y) [0] # Spearman's rho0.9757575757575757scipy.stats.kendalltau (x, y) [0] # Kendall's tau0.911111111111111Pandas correlation calculation

Relatively speaking, the calculation is relatively simple.

Import pandas as pdx = pd.Series (range (10,20)) y = pd.Series ([2,1,4,5,8,12,18,25,96,48]) x.corr (y) # Pearson's r0.7586402890911867y.corr (x) 0.7586402890911869x.corr (y, method='spearman') # Spearman's rho0.9757575757575757x.corr (y, method='kendall') # Kendall's tau 0.91111111111111111

Linear correlation measures how close the mathematical relationship between variables or data set features is to the linear function. If the relationship between the two features is closer to a linear function, then their linear correlation is stronger and the absolute value of the correlation coefficient is higher.

Linear regression: SciPy implementation

Linear regression is the process of finding a linear function that is as close to the actual relationship between features as possible. In other words, you determine the linear function that best describes the correlation between features, also known as the line of regression.

Import pandas as pdx = pd.Series (range (10,20)) y = pd.Series ([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

Use scipy.stats.linregress () to perform a linear regression on two arrays of the same length.

Result = scipy.stats.linregress (x, y) scipy.stats.linregress (xy) LinregressResult (slope=7.4363636363636365, intercept=-85.92727272727274, rvalue=0.7586402890911869, pvalue=0.010964341301680825, stderr=2.257878767543913) result.slope # Slope of the regression line 7.4363636363636365result.intercept # intercept of the regression line-85.92727272727274result.rvalue # correlation coefficient 0.7586402890911869result.pvalue # p value 0.010964341301680825result.stderr # Standard error of estimated gradient 2.257878767543913

For more content in the future, refer to the linear regression content in the machine learning column.

Rank correlation

Compare the ranking or ranking of data related to the characteristics of two variables or datasets. If the sort is similar, the correlation is strong, positive and high. But if the order is close to reversal, the correlation is strong, negative, and low. In other words, hierarchical correlation is only related to the order of values, not to specific values in the dataset.

Figures 1 and 2 show the observation that larger x values always correspond to larger y values, which is a perfect positive hierarchical correlation. Figure 3 illustrates the opposite, that is, the perfect negative hierarchical correlation.

Ranking: SciPy implementation

Use scipy.stats.rankdata () to determine the ranking of each value in the array.

Import numpy as npimport scipy.statsx = np.arange (10,20) y = np.array ([2, 1, 4, 5, 8, 12, 18, 25, 96, 48]) z = np.array ([5, 3, 2, 1, 0,-2,-8,-11,-15,-16]) # get the ranking order scipy.stats.rankdata (x) # monotone increasing array ([1,2,3.4,5.6,6. 7, 8, 9, 10.]) scipy.stats.rankdata (y) array ([2, 1, 3, 4, 5, 6, 7, 8, 10, 9.]) scipy.stats.rankdata (z) # monotone decreasing array ([10, 9, 8, 7, 6, 5, 4, 3, 2, 1.])

Rankdata () treats the nan value as extremely large.

Scipy.stats.rankdata ([8, np.nan, 0,2]) array ([3.4,1.2,2]) Hierarchical correlation: NumPy and SciPy implementation

Use scipy.stats.spearmanr () to calculate the Spearman correlation coefficient.

Result = scipy.stats.spearmanr (x, y) resultSpearmanrResult (correlation=0.9757575757575757, pvalue=1.4675461874042197e-06) result.correlation0.9757575757575757result.pvalue1.4675461874042197e-06rho, p = scipy.stats.spearmanr (x, y) rho0.9757575757575757p1.4675461874042197e-06 hierarchical correlation: Pandas implementation

Use Pandas to calculate the correlation coefficient between Spearman and Kendall.

Import numpy as npimport scipy.statsx = np.arange (10,20) y = np.array ([2, 1, 4, 5, 8, 12, 18, 25, 96, 48]) z = np.array ([5, 3, 2, 1, 0,-2,-8,-11,-15,-16]) x, y, z = pd.Series (x), pd.Series (y), pd.Series (z) xy = pd.DataFrame Xyz: y}) xyz = pd.DataFrame ({'xmurvaluesvalues: X,' ymurvaluesvalues: y, 'zmurvaluesvalues: Z})

Calculate the rho,method=spearman of the Spearman.

X.corr (y, method='spearman') 0.9757575757575757xy.corr (method='spearman') x-values y-valuesx-values 1.000000 0.975758y-values 0.975758 1.000000xyz.corr (method='spearman') x-values y-values z-valuesx-values 1.000000 0.975758-1.000000y-values 0.975758 1.000000-0.975758z-values-1.000000-0.975758 1.000000xy.corrwith (z Method='spearman') x-values-1.000000y-values-0.975758dtype: float64

Calculate the tau and method=kendall of Kendall.

X.corr (y, method='kendall') 0.911111111111111xy.corr (method='kendall') x-values y-valuesx-values 1.000000 0.911111y-values 0.911111 1.000000xyz.corr (method='kendall') x-values y-values z-valuesx-values 1.000000 0.911111-1.000000y-values 0.911111 1.000000-0.911111z-values-1.000000-0.911111 1.000000xy.corrwith (z Method='kendall') x-values-1.000000y-values-0.911111dtype: visualization of float64 correlation

Data visualization is very important in statistics and data science. It can help to better understand the data and better understand the relationship between features.

Matplotlib is used here for data visualization.

Import matplotlib.pyplot as pltplt.style.use ('ggplot') import numpy as npimport scipy.statsx = np.arange (10, 20) y = np.array ([2, 1, 4, 5, 8, 12, 18, 25, 96, 48]) z = np.array ([5, 3, 2, 1, 0,-2,-8,-11,-15,-16]) xyz = np.array ([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]) [2, 1, 4, 5, 8, 12, 18, 25, 96, 48], [5, 3, 2, 1, 0,-2,-8,-11,-15,-16]) XY graphs with regression lines

Use linregress () to get the slope and intercept of the regression line, as well as the correlation coefficient.

Slope, intercept, r, p, stderr = scipy.stats.linregress (x, y)

Construct the linear regression formula.

Line = f'y = {intercept:.2f} + {slope:.2f} x, r = {rblo.2f} 'line'y=-85.93+7.44x, rang 0.76'

Plot () plot

Fig, ax = plt.subplots () ax.plot (x, y, linewidth=0, marker='s', label='Data points') ax.plot (x, intercept + slope * x, label=line) ax.set_xlabel ('x') ax.set_ylabel ('y') ax.legend (facecolor='white') plt.show ()

Heat map matplotlib of correlation matrix

It is ideal to use heat map to deal with the correlation matrix with more features.

Corr_matrix = np.corrcoef (xyz) .round (decimals=2) corr_matrixarray ([[1., 0.76,-0.97], [0.76, 1.,-0.83], [- 0.97,-0.83,1.]])

For convenience, the relevant data is rounded and plotted with .imshow ().

Fig, ax = plt.subplots () im = ax.imshow (corr_matrix) im.set_clim (- 1,1) ax.grid (False) ax.xaxis.set (ticks= (0,1,2), ticklabels= ('x, y, z') ax.yaxis.set (ticks= (0,1,2), ticklabels= ('x, y, z')) ax.set_ylim -0.5) for i in range (3): for j in range (3): ax.text (j, I, corr_matrix [I, j], ha='center', va='center', color='r') cbar = ax.figure.colorbar (im, ax=ax, format='% .2f') plt.show ()

Heat map seabornimport seaborn as snsplt.figure (figsize= (11,9), dpi=100) sns.heatmap (data=corr_matrix, annot_kws= {'size':8,'weight':'normal',' color':'#253D24'}, # numeric attribute settings, such as font size, pound, color) of the correlation matrix

After reading this, the article "what are the three ways of Python for data correlation analysis" has been introduced. If you want to master the knowledge points of this article, you still need to practice and use it yourself. If you want to know more about related articles, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report