In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the relevant knowledge of "how to use Python for multiple linear regression". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Figure 1. Formulas to be used in multiple regression models
As shown in figure 1, we assume that the linear regression model between random variable y and general variables x1, x2,..., xp is formula (1), where y is a dependent variable, x1, x2,..., xp are independent variables, β 1, β 2,..., β p are regression coefficients, and β 0 is regression constant. For a practical problem, if we obtain n sets of observation data (xi1,xi2,...,xip;y) (I = 1Magne2), then we can write these n groups of observation data in the form of matrix yellowX β + ε.
After finding out the regression equation, we often have to test the significance of the regression equation. The significance test here mainly includes three parts. The first is the F test, that is, to test whether the independent variables x1, x2,..., and xp have an obvious influence on y as a whole, mainly using formulas (2), (3) and (4), where (2) and (3) are a formula, but they are represented by different symbols. The second is the t-test, which tests the significance of each independent variable, that is, whether each independent variable has a significant effect on y, which is different from the previous overall test. The third is goodness of fit, that is, R2, whose value is between 0 and 1, and the closer it is to 1, the better the effect of regression fitting is, and the closer it is to 0, the worse the effect is, but R can only directly reflect the effect of fitting. can not replace F test as a strict significance test.
The above is a brief introduction of multiple linear regression, its detailed principles are more, interested readers can consult the relevant literature, here will not repeat, only focus on how to use python analysis. Let's use the code to show the analysis process of multiple linear regression.
The data we use here come from the 2013 China Statistical Yearbook. The data take the consumption expenditure of residents as the dependent variable y, and the other nine variables as independent variables, where x1 is the food expenditure of residents, x2 is clothing expenditure, x3 is residential expenditure, x4 is health care expenditure, x5 is cultural, educational and entertainment expenditure, x6 is the average wage of employees, x7 is the per capita GDP,x8 of the region, and x7 is the regional consumer price index. X9 is the regional unemployment rate. Of all these variables, x1 to x7 and y are in units of yuan, x9 is a percentage, and x8 has no units because it is the consumer price index. The overall size of the data is 31x10, that is, 31 rows and 10 columns, roughly as shown in figure 2.
Figure 2. Partial contents of the dataset
The first step is to import the required libraries.
Import numpy as np import pandas as pd import statsmodels.api as sm
Next is the data preprocessing, because the column label of the original data is too long, we have to deal with it, remove the Chinese, leaving only the English name.
File = ritual C:\ Users\ data.xlsx' data = pd.read_excel (file) data.columns = ['yearly,' x1,'x2,'x3,'x4,'x5,'x6,'x7,'x8,'x9]
Then we start to generate a multivariate linear model with the following code.
X = sm.add_constant (data.iloc [:, 1:]) # Generation independent variable y = data ['y'] # Generation dependent variable model = sm.OLS (y, x) # Generation Model result = model.fit () # Model fitting result.summary () # Model description
Obviously, the independent variables here refer to the nine independent variables x1 to x9. The code data.iloc [:, 1:] removes the data from the first column of the original data, that is, the column y, and result.summary () generates a description of the result, as shown in figure 3.
Figure 3. Regression results containing all independent variables
In this result, we mainly look at "coef", "t" and "P > | t |". Coef is the regression coefficient mentioned earlier, and the value of const is the regression constant, so the regression model we get is y = 320.640948 + 1.316588 x1 + 1.649859 x2 + 2.17866 x3-0.005609 x4 + 1.684283 x5 + 0.01032 x6 + 0.003655 x7-19.130576 x8 + 50.515575 x9. While "t" and "P > | t |" are equivalent, just choose one of them when you use them, which is mainly used to judge the linear significant relationship between each independent variable and y, which we will talk about later. It can also be seen from the figure that Prob (F-statistic) is 4.21e-20, and this value is our commonly used P value, which is close to zero, indicating that our multivariate linear equation is significant, that is, y has a significant linear relationship with x1, x2,., x9, while R-squared is 0.992, which also shows that this linear relationship is significant. In theory, this multivariate linear equation has been solved, and the effect is good, we can use it to predict, but here we still need to go further. As mentioned earlier, y has a significant linear relationship with x1, x2,., x9. Here, we should note that the nine variables x1 to x9 are regarded as a whole, and y has a significant linear relationship with the whole, but it does not mean that y has a significant linear relationship with each of the independent variables. here, we need to find out those independent variables that have no significant linear relationship with y, and then eliminate them, leaving only those with significant linear relationship. This is the t-test mentioned earlier. The principle of t-test is a little complicated. Interested readers can consult the data on their own. I will not repeat it here. We can judge from the column "P > | t |" in figure 3. In this column, we can choose a threshold. For example, the commonly used ones in statistics are 0.05, 0.02 or 0.01. Here we use 0.05. All independent variables in this column whose values are greater than 0.05 are excluded. These are independent variables that are not significantly related to y linearity, so leave them out. Please note that the independent variables here are x1 to x9. The value of const in figure 3 is not included. But there is a principle here, that is, only one can be removed at a time, and the one that is removed is often the one with the highest P value. For example, the one with the largest P value in figure 3 is x4, then remove it. Then repeat the modeling process with the remaining x1, x2, x3, x5, x6, x7, x8, x9, and then find the independent variable with the highest P value, eliminate it, and repeat the process like this. Until all P values are less than or equal to 0.05, the remaining independent variables are the independent variables we need, and the linear relationship between these independent variables and y is more significant. We need to use these independent variables to model.
We can write the above process as a function, named looper, with the following code.
Def looper (limit): cols = ['x1,'x2,'x3,'x5,'x6,'x7,'x8,'x9'] for i in range (len (cols)): datadata1 = data [cols] x = sm.add_constant (data1) # generate independent variable y = data ['y'] # generation dependent variable model = sm.OLS (y) X) # generate model result = model.fit () # Model fitting pvalues = result.pvalues # get all P values pvalues.drop ('const') in the result Inplace=True) # get the const pmax = max (pvalues) # choose the maximum P value if pmax > limit: ind = pvalues.idxmax () # find the index cols.remove (ind) with the maximum P value # remove this index from the cols else: return result result = looper (0.05) result.summary ()
The result is shown in figure 4. From the results, we can see that the last remaining effective variables are x1, x2, x3 and x5, and the multivariate linear model we get is y =-1694.6269 + 1.3642 x1 + 1.7679 x2 + 2.2894 x3 + 1.7424 x5. This is the effective multivariate linear model that we will eventually use.
Figure 4. Regression model after eliminating invalid variables
So here comes the problem: the previous multivariate linear model that contains all independent variables and the model that removes some variables, which we should choose, after all, the overall linear effect of the first model is also quite significant, according to the author's experience, this still depends on the specific project requirements. Because the problems encountered in our actual projects are real examples in real life, and are no longer simple mathematical problems, such as the x8 consumer price index and the unemployment rate in x9 areas in this case, these two must have a certain impact on y, if blindly eliminated, it may have a negative impact on the final results, so we still have to make decisions according to the actual needs.
This is the end of the content of "how to use Python for multiple linear regression". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.