How to realize Linear regression with StatsModels in Python 07/19 Update SLTechnology News&Howtos

How to realize Linear regression with StatsModels in Python

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly shows you "how to achieve linear regression in Python". The content is simple and clear. I hope it can help you solve your doubts. Let me lead you to study and learn this article "how to achieve linear regression in StatsModels in Python".

1. Background knowledge 1.1 interpolation, fitting, regression and prediction

Interpolation, fitting, regression and prediction are all concepts often mentioned in mathematical modeling, and are often confused.

Interpolation is the interpolation of continuous functions on the basis of discrete data to make the continuous curve pass through all given discrete data points. Interpolation is an important method of discrete function approximation, which can be used to estimate the approximate value of the function at other points through the value condition of the function at a finite number of points.

Fitting is to use a continuous function (curve) close to the given discrete data to make it consistent with the given data.

Therefore, both interpolation and fitting are the process of finding approximate curves with similar characteristics according to known data points, but interpolation requires that the approximate curves pass through the given data points completely, while fitting only requires that the approximate curves are as close to the data points as possible as a whole, and reflect the changing law and development trend of the data. Interpolation can be regarded as a special fitting, which requires that the error function is 0. Because the data points usually have errors, an error of 0 often means over-fitting, and the over-fitting model has poor generalization ability for data outside the training set. Therefore, in practice, interpolation is mostly used in image processing, and fitting is mostly used in experimental data processing.

Regression is a statistical analysis method to study the relationship between one group of random variables and another group of random variables, including establishing a mathematical model and estimating the model parameters, and testing the credibility of the mathematical model. It also includes the use of the established model and estimated model parameters for prediction or control.

Prediction is a very extensive concept, which refers to the quantitative study of the data and information obtained, based on which a mathematical model suitable for the purpose of prediction is established, and then the development and changes in the future are predicted quantitatively. It is generally believed that interpolation and fitting are prediction methods.

Regression is a method of data analysis, and fitting is a specific method of data processing. Fitting focuses on the optimization of curve parameters to make the curve consistent with the data, while regression focuses on the relationship between two or more variables.

1.2 Linear regression

Regression analysis (Regression analysis) is a statistical analysis method, which studies the quantitative relationship between independent variables and dependent variables. It is often used in predictive analysis, time series models and finding the causal relationship between variables. According to the type of relationship between variables, regression analysis can be divided into linear regression and nonlinear regression.

Linear regression (Linear regression) assumes that there is a linear relationship between the target (y) and the characteristic (X) in a given data set, that is, it satisfies a multivariate linear equation. In regression analysis, only one independent variable and one dependent variable are included, and the relationship between them can be approximately expressed by a straight line, which is called univariate linear regression; if two or more independent variables are included, and there is a linear relationship between dependent variables and independent variables, it is called multivariate linear regression.

According to the sample data, the estimators of the parameters of the linear regression model can be obtained by using the least square method, and the sum of squares of the errors between the model data calculated according to the estimated parameters and the given sample data can be minimized.

Furthermore, it is necessary to analyze whether the linear regression method can be used for the sample data, or whether the linear correlation hypothesis is reasonable, and whether the linear model has good stability. It is necessary to use statistical analysis for significance test to test whether the linear relationship between dependent variables and independent variables is significant, and whether it is appropriate to use linear model to describe the relationship between them.

2. Statsmodels linear regression.

This section introduces linear fitting and regression analysis combined with the use of Statsmodels statistical analysis package. The linear model can be expressed as follows:

2.1 Import Toolkit

Import statsmodels.api as sm from statsmodels.sandbox.regression.predstd

Import wls_prediction_std

2.2 Import sample data

Sample data is usually saved in a data file, so read the data file to get the sample data. In order to facilitate reading and testing the program, this paper uses random numbers to generate sample data. The method of reading data file and importing data will be described later.

# generate sample data: nSample = 100x1 = np.linspace (0,10, nSample) # the starting point is 0 and the end point is 10, which is divided into nSample points e = np.random.normal (size=len (x1)) # normal distribution random number yTrue = 2.361.58x1 # y = b0 + b1*x1yTest = yTrue + e # to generate model data

This case is a univariate linear regression problem, (yTest,x) is the imported sample data, we need to obtain the quantitative relationship between the dependent variable y and the independent variable x through linear regression. YTrue is the value of the ideal model, and yTest simulates the data tested by the experiment, adding the random error of normal distribution to the ideal model.

2.3 Modeling and fitting

The equation of univariate linear regression model is as follows:

Y = β 0 + β 1 * x + e

After adding the intercept column to the matrix X by sm.add_constant (), the ordinary least square model is established by sm.OLS (). Finally, the linear regression model can be fitted by model.fit (), and the summary of the results of fitting and statistical analysis can be returned.

X = sm.add_constant (x1) # add the intercept column x0 = [1, … 1]

Model = sm.OLS (yTest, X) # Building least Squares Model (OLS)

Results = model.fit () # returns the result of model fitting

Statsmodels.OLS is a function of statsmodels.regression.linear_model with four parameters (endog, exog, missing, hasconst).

The first parameter, endog, is the dependent variable y (t) in the regression model, which is the data type of 1Murd array.

The second input exog is the independent variable x0 (t), x1 (t), … , xm (t), is the (massi1)-d array data type.

It should be noted that statsmodels.OLS 's regression model does not have a constant term, which is in the form of:

Y = Baux + e = β 0*x0 + β 1*x1 + e, x0 = [1, … 1]

The previously imported data (yTest,x1) does not contain x0, so you need to add a column of intercept column x0 = [1, … 1], convert the independent variable matrix to X = (x0, x1). The function sm.add_constant () implements this function.

The parameter missing is used for data checking, and hasconst is used to check constants, which is generally not required.

2.4 output of fitting and statistical results

The output of linear regression analysis by Statsmodels is very rich, and results.summary () returns the summary of regression analysis.

Summary of print (results.summary ()) # output regression analysis

The summary returns a lot of content, and here we first discuss some of the most important results, in the middle paragraph of the summary.

= coef std err t P > | t | [0.025 0.975]-const 2.4669 0.186 13.230 0.000 2.097 2.837x1 1.5883 0.032 49.304 0.000 1.524 1.652

Coef: regression coefficient (Regression coefficient), that is, model parameters β 0, β 1, … The estimated value of.

Std err: standard deviation (Standard deviation), also known as standard deviation, is the arithmetic square root of variance, reflecting the average difference between sample data and regression model estimates. The larger the standard deviation is, the less reliable the regression coefficient is.

The t-Statistic statistic, which is equal to the regression coefficient divided by the standard deviation, is used to test each regression coefficient separately to test whether the influence of each independent variable on the dependent variable is significant. If the influence of an independent variable xi is not significant, it means that the independent variable can be removed from the model.

P > | t |: t test P value (Prob (t-Statistic)), reflecting the significance of the hypothesis of correlation between each independent variable xi and the dependent variable y. If p | t | [0.025 0.975]-const 2.4669 0.186 13.230 0.000 2.097 2.837x1 1.5883 0.032 49.304 0.000 1.524 1.652====Omnibus: 0.070 Durbin-Watson: 2.016Prob (Omnibus): 0.966 Jarque-Bera (JB): 0.187Skew: 0.056 Prob (JB): 0.911Kurtosis: 2.820 Cond. No. 11.7====OLS model: y = b0 + b1 * xParameters: [2.46688389 1.58832741] 4, Multiple linear regression 4.1Multivariate linear regression Python program: # LinearRegression_v2.py# LinearRegression with statsmodels (OLS: Ordinary Least Squares) # v2.0: call statsmodels to achieve multiple linear regression # date: 2021-05-04import numpy as npimport matplotlib.pyplot as pltimport statsmodels.api as smfrom statsmodels.sandbox.regression.predstd import wls_prediction_std# main program def main (): # main program # generate test data: nSample = 100x0 = np.ones (nSample) # intercept column x0 = [1 ..1] x1 = np.linspace (0,20, nSample) # starting point is 0 The end point is 10 They are all divided into nSample points x2 = np.sin (x1) x3 = (x1-5) * * 2X = np.column_stack ((x0memex1rect x2, x3)) # (nSample,4): [x0rect x1rect x2m... beta] beta = [5.0.5,0.5,0.5,0.02] # beta = [b1Perry b2m.] yTrue = np.dot (X) Beta) # Vector dot product y = b1*x1 +... + bm*xm yTest = yTrue + 0.5 * np.random.normal (size=nSample) # generate model data # multiple linear regression: least square (OLS) model = sm.OLS (yTest) X) # set up OLS model: y = b0 + b1roomX +... + bm*Xm + e results = model.fit () # return model fitting result yFit = results.fittedvalues # y value of model fitting print (results.summary ()) # output the summary of regression analysis print ("\ nOLS model: y = b0 + b1roomX +... + bm*Xm") print ('Parameters:' Results.params) # output: coefficient of fitting model # drawing: original data point Fitting curve Confidence interval prstd, ivLow, ivUp = wls_prediction_std (results) # returns standard deviation and confidence interval fig, ax = plt.subplots (figsize= (10,8)) ax.plot (x1, yTest, 'oasis, label= "data") # Experimental data (original data + error) ax.plot (x1, yTrue,' bmurf, label= "True") # Raw data ax.plot (x1, yFit, 'rmurf' Label= "OLS") # fitting data ax.plot (x1, ivUp,'-', color='orange', label= "ConfInt") # confidence interval of the previous ax.plot (x1, ivLow,'-' Color='orange') # confidence interval the next ax.legend (loc='best') # shows the legend plt.xlabel ('x') plt.ylabel ('y') plt.show () returnif _ name__ = ='_ main__': main () 4.2 multiple linear regression program: OLS Regression Results = Dep. Variable: y R-squared: 0.932Model: OLS Adj. R-squared: 0.930Method: Least Squares F-statistic: 440.0Date: Thu 06 May 2021 Prob (F-statistic): 6.04e-56Time: 10:38:51 Log-Likelihood:-68.709No. Observations: 100 AIC: 145.4Df Residuals: 96 BIC: 155.8Df Model: 3 Covariance Type: nonrobust = coef std err t P > | t | [0.025 0.975] -const 5.0411 0.120 41.866 0.000 4.802 5.280x1 0.4894 0.019 26.351 0.000 0.452 0.526x2 0.5158 0.072 7.187 0.000 0.373 0.658x3-0.0195 0.002-11.957 0.000-0.023-0.016====Omnibus: 1.472 Durbin-Watson: 1.824Prob (Omnibus): 0.479 Jarque-Bera (JB): 1.194Skew: 0.011 Prob (JB) : 0.551Kurtosis: 2.465 Cond. No. 223.====OLS model: y = b0 + b1roomX +... + bm*XmParameters: [5.04111867 0.4893574 0.51579806-0.01951219]

5. Appendix: detailed description of the regression results

Dep.Variable: y dependent variable

Model:OLS least squares model

Method: Least Squares least squares

No. Observations: the number of sample data

Df Residuals: residual degrees of freedom (degree of freedom of residuals)

Df Model: model degrees of Freedom (degree of freedom of model)

Robustness of Covariance Type:nonrobust Covariance Matrices

R-squared:R decision coefficient

Adj. R-squared: modified decision coefficient

F-statistic: statistical test F statistic

Prob (F-statistic): P value of F test

Log likelihood: logarithmic likelihood coef: the coefficients of independent variables and constant terms, b1, dint, b2, BM, B0

Std err: standard error of coefficient estimation

T: statistical test t statistic

P > | t |: P value of t test

[0.025, 0.975]: lower and upper limits of 95% confidence interval of estimated parameters

Omnibus: testing data normality based on kurtosis and skewness

Prob (Omnibus): test probability of data normality based on kurtosis and skewness

Durbin-Watson: check whether there is autocorrelation in the residual

Skewness: skewness, reflecting the degree of asymmetry of data distribution

Kurtosis: kurtosis, reflecting the steepness or smoothness of data distribution

Jarque-Bera (JB): test of data normality based on kurtosis and skewness

Prob (JB): P value of Jarque-Bera (JB) test.

Cond. No.: tests whether there is an exact correlation or a high correlation between variables.

These are all the contents of the article "how to achieve linear regression in StatsModels in Python". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.