How to understand Python linear regression 04/16 Update SLTechnology News&Howtos

How to understand Python linear regression

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to understand Python linear regression". The explanation content in this article is simple and clear, easy to learn and understand. Please follow the ideas of Xiaobian slowly and deeply to study and learn "how to understand Python linear regression" together!

Foreword:

Linear regression model is a classical statistical model. The application scenario of this model is to predict a continuous numerical variable (i.e. dependent variable) according to known variables (i.e. independent variables). For example, restaurants predict the dining scale or turnover according to the business data of media (including menu price, number of diners, number of reservations, special dish discounts, etc.); websites predict the payment conversion rate of users according to the historical data of visits (including the registration volume of new users, the activity of old users, the update frequency of website content, etc.); hospitals predict the probability of occurrence of certain diseases according to the medical record data of patients (such as physical examination indicators, drug reuse, usual eating habits, etc.).

Understand what linear regression is.

Linear Regression, also called Ordinary Least-Squares (OLS) Regression. The mathematical model goes like this:

y = a+ b* x+e

where a is called the constant term or intercept;b is called the regression coefficient or slope of the model; and e is the error term. a and b are parameters of the model.

Of course, model parameters can only be estimated from sample data:

y'= a' + b'* x

Our goal is to select the appropriate parameters for this linear model to best fit the observed values. The better the fit, the better the model.

Because there are too many documents related to specific operations, the content of this article involves less specific operations, mainly methods.

Packages used in this article:

import pandas as pdimport matplotlib.pyplot as pltimport numpy as npimport seaborn as snsimport scipy.stats as statsimport statsmodels.api as smfrom scipy.stats import chi2_contingency1. Simple linear regression model

Also known as a linear regression model, this model contains only one independent variable and one dependent variable.

Generally, the relationship between two variables can be characterized by scatter plot, and simple linear fitting line can be drawn based on scatter plot, so as to make the relationship between variables more intuitive.

Take the classic braking distance as an example:

ccpp=pd.read_csv('cars.csv')sns.set(font=getChineseFont(8).get_name())sns.lmplot(x='speed',y='dist',data=ccpp, legend_out=False,#render legend inside frame truncate=True#Truncate the fit line based on the actual data range )plt.show()

From the scatter plot, there is a significant positive correlation between the independent variable speed and the dependent variable dist, that is, the greater the braking speed, the longer the braking distance. The shaded portion of the plot is the 95% confidence interval of the fitted line, and each scatter point is centered as close to the fitted line as possible.

The parameter solution of regression model obtained by ols function 'y~s' represents simple linear regression model

fit=sm.formula.ols('dist~speed',data=ccpp).fit()print(fit.params)#Intercept：-17.579095,speed：3.932409

Thus, a simple linear model for braking distance is:

dist=-17.579095+3.932409speed

In other words, every unit increase in braking speed will increase the braking distance by 3.93 units.

2. multiple linear regression model

In practice, simple linear regression models are rare, because there is often more than one independent variable affecting the dependent variable. The data used to construct the multiple linear regression model actually consists of two parts: one is a one-dimensional dependent variable y; the other is a two-dimensional matrix independent variable x.

Take Profit Statement as an example to study the factors affecting profit

The table is structured as follows:

profit=pd.read_csv('Profit.csv',sep=",")

fit=sm.formula.ols('Profit~RD_Spend+Administration+Marketing_Spend',data=profit).fit()print(fit.params)#Intercept:50122.192990,RD_Spend:0.805715,Administration:-0.026816,Marketing_Spend:0.027228

Regardless of the significance of the model and the significance of the regression coefficients, the resulting regression model can be expressed as:

Profit=50122.192990+0.805715RD_Spend-0.026816Administration+0.027228Marketing_Spend

However, in practical application, the partial regression coefficient is calculated by ols function, and a multiple linear regression model is constructed. The result is often not ideal. At this time, the significance test of the model and the significance test of the regression coefficient need to be completed by means of F test and t test in statistics.

2.1 The significance test of the model was completed by F-test

The procedure is as follows:

(b) To present the null hypothesis and alternative hypothesis of the problem;

Under the condition of null hypothesis, construct statistic F;

calculating the value of statistic F according to sample information;

Compare the value of the statistic to the theoretical value, reject the null hypothesis if the calculated statistic exceeds the theoretical value, otherwise accept the null hypothesis

Assume that all partial regression coefficients of the model are considered to be zero (i.e., no independent variable is considered to constitute a linear combination of dependent variables)

Usually in practical applications the probability p is compared with 0.05, less than 0.05 rejects the null hypothesis, otherwise accepts the null hypothesis

2.2 The significance test of regression coefficient was completed by t-test

The model passed the significance test, which can only show that the linear combination of the model about dependent variables is reasonable, but it can not show that each independent variable has significant significance to dependent variables, so the regression coefficient of the model should be tested for significance.

Only when the regression coefficient passes the t-test, can the coefficient of the model be considered significant. The starting point of the t-test is to verify whether each independent variable can be an important factor affecting the dependent variable. The null hypothesis of the t-test is to assume that the regression coefficient of the jth variable is 0, that is, to consider that the variable is not a dependent variable. If the t-statistic is greater than the theoretical t-distribution value, then reject the null hypothesis. Otherwise, accept the null hypothesis; or judge whether to reject the null hypothesis according to the probability value P

#Return to model overview print(fit.summary())

As can be seen from the figure: F-statistic:296.0, Prob (F-statistic):4.53e-30, F statistic value is 296.0, and the corresponding probability value P is much less than 0.05, indicating that the null hypothesis should be rejected and the model is considered significant.

In the t-statistics of their respective variables, the probability value p corresponding to the Administration and Marketing_Spend variables is greater than 0.05, indicating that the null hypothesis cannot be rejected, that the variable is not significant, and that the important factors affecting Profit cannot be identified.

For an F-test, if the null hypothesis cannot be rejected, the model is considered invalid, and the usual solution is to increase the data size, change the independent variable, or select another model; for a t-test, if the null hypothesis cannot be rejected, the corresponding independent variable is considered to have no linear relationship with the dependent variable, and the usual solution is to eliminate the variable or modify the variable (e.g., select the mathematical transformation function for).

Based on the overview information of the fit model returned, since the t-test results for the Administration and Marketing_Spend variables are not significant, scatter relationships between the remaining dependent variables Profit can be explored, and if there is indeed no linear relationship, it can be excluded from the model.

sns.lmplot(x='Administration',y='Profit',data=profit, legend_out=False,#render legend inside frame fit_reg=False#Do not show fit curve )sns.lmplot(x='Marketing_Spend',y='Profit',data=profit, legend_out=False,#render legend inside frame fit_reg=False#Do not show fit curve )plt.show()

In the figure, the independent variables Administration, Marketing_Spend and the dependent variable Profit do not show obvious linear relationship, so it can be considered that there is no interdependence between them.

#Remove Administration, Marketing_Spend variables from model fit2 = sm.formula.ols ('Profit ~RD_Spend',data=profit).fit()print(fit2.params)#Intercept: 49032.899141, RD_Spend: 0.854291print(fit2.summary())#Prob (F-statistical):3.50e-32;P>| t|：0.000

After adjusting the model fit, the new model fit2 still passes the significance test, and the coefficients corresponding to each independent variable also pass the significance test.

The final result is:

Profit=49032.899141+0.854291RD_Spend

The coefficients in this regression model are interpreted such that, ceteris paribus, each unit increase in RD_Spend increases Profit by 0.854291 units.

3. Identifying Outliers Based on Regression Model

The regression model calculation process depends on the mean of the independent variables. The biggest disadvantage of the mean is that it is easily affected by outliers (or extreme values).

There are outliers in the modeling data, which will affect the effectiveness of modeling to some extent.

For current regression models, hat matrices, DFFITS criteria, studentized residuals, or Cook distances are commonly used for outlier detection

Use the above four methods to determine whether the ith sample of the dataset is an outlier, provided that a linear regression model has been constructed, and then obtain the values of the four statistics based on the get_influence method.

Continue using the above data.

#outliers=fit2.get_influence()#high leverage points (hat matrix) leverage=outliers.hat_matrix #diffits value dffits=outliers.dffits[0]#studentized residuals resid_stu=outliers.resid_studentized_external#cook distance cook=outliers.cooks_distance[0]

#Statistical values combining the above 4 outlier tests

concat_result=pd.concat([pd.Series(leverage,name='leverage'),pd.Series(dffits,name='diffits'),

pd.Series(resid_stu,name='resid_stu'),pd.Series(cook,name='cook')],axis=1)

#Combine the concat_result result above with the profit data set

raw_outliers=pd.concat([profit,concat_result],axis=1)

The first five rows of the dataset print out as follows:

For simplicity, studentized residuals are used here. When studentized residuals are greater than 2, corresponding data points are considered outliers.

#Calculate outliers_ratio=sum(np.where(np.abs(raw_outliers.resid_stu)>2,1,0))/raw_outliers.shape[0]print(outliers_ratio)#0.04

The results showed that outliers were identified by studentized residuals and the outlier proportion was 4%. Because the outlier proportion is so small, consider removing it directly from the dataset, and continuing modeling will result in a more robust and reasonable model

#Remove outliers by filtering none_outliers=raw_outliers.loc[np.abs(raw_outliers.resid_stu)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.