In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "how to realize multiple linear regression in R language". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Establishment of ⑴ multiple regression model
When there is more than one independent variable, it is multiple linear regression (multivariable linearregression, MLR), and polynomial regression can be regarded as multivariate linear regression in special cases. Now we take the microbial community data as an example to explore the relationship between α diversity index and environmental factors (Salinity, pH, TN, TP, which have a high interpretation of microbial community in 3.3.2.4VPA analysis). As follows: # read species and environmental factor information data=read.csv ("otu_table.csv", header=TRUE, row.names=1) otu=t (data) envir=read.table ("environment.txt", header=TRUE) rownames (envir) = envir [, 1] env=envir [,-1] # calculate alpha diversity library (vegan) shannon=diversity (otu, index= "shannon") biodata=data.frame (shannon, env [, c ("Salinity", "pH", "TN", "TP")] fit=lm (shannon~Salinity+pH+TN+TP, data=biodata) summary (fit)
The regression results are as follows:
The test result of the overall regression model is significant, the variance explanation rate is 60%, but only 3 of the five coefficients are significantly not zero. In multiple regression, R2 generally increases with the increase of explanatory variables, whether or not these explanatory variables are related to response variables, which is mainly due to the existence of random correlation. So more strictly, we need to correct R2 according to the degrees of freedom, and the most commonly used correction methods are as follows:
The above formula is called Ezekiel formula. The corrected R2 (51%) has been given in the above multiple regression results, and we can also use the RsquareAdj () function in the vegan package to correct R2 in the class multiple regression model (MLR, RDA, etc.), as follows:
Library (vegan) RsquareAdj (fit)
In the above multiple regression analysis, the interaction term is not taken into account, but the interpretation model of the interaction term often makes the study more interesting, and the interaction shows that the effects of the two explanatory variables on the response variables are not independent. for example, the toxicity caused by the increase of the concentration of two heavy metals is greater than that when they exist alone, as follows:
Fit2=lm (shannon~Salinity+TN+TP:pH, data=biodata) summary (fit2)
It can be seen that after using the interaction term of TP and pH as a predictive variable, the significance of the coefficient is enhanced as a whole, but in the first regression, the coefficient of TP and diversity is not significant. The interaction term can be understood that the effect of TP on diversity depends on the level of pH, that is, the effect of TP on microbial community is different under different pH. Complex multiple linear regression can be realized by RDA analysis.
⑵ regression diagnosis
We can use the univariate regression diagnosis method to make a simple diagnosis, and the results are as follows:
Par (mfrow=c (2)) plot (fit)
In R, the car package provides a more detailed diagnostic function of the regression model, and then we evaluate the multiple regression model in detail.
① normality
You can test normality by checking whether the residual satisfies the t-distribution, as follows:
Library (car) par (mfrow=c (1Power1)) qqPlot (fit, labels=names (shannon), id.method= "identify", simulate=TRUE)
Among them, 95% of the confidence interval of simulate=TRUE is generated by parameter self-help method (parametric bootstrap), which assumes that the sample is a population, and then adopts the method of putting back sampling to determine the distribution of parameters. As shown in the following figure, no outliers beyond the confidence interval are observed, that is, the data is normal:
② residual independence
Next, to check whether the residual is relevant, you can use the durbinWatsonTest () function to do the Durbin-Waston test, as shown below:
DurbinWatsonTest (fit, simulate=TRUE, reps=999)
The parameter reps sets the number of self-help sampling, and the result p value is just greater than 0.05. the zero hypothesis, that is, residual correlation, can be rejected, indicating that the residual is independent. However, this p value is very small, it is still possible to be independent, it is conceivable that the lower the diversity index is, the smaller the error range is, and the closer the predicted value is to the observed value, so the residual may not be independent.
③ linearity
Whether the dependent variable has a linear relationship with the independent variable can be tested by the component residual diagram, the method is as follows:
CrPlots (fit)
As shown in the following figure, the component residual map takes each prediction variable as the Abscissa, and the residual of the whole model plus the product of the prediction variable and its coefficient (that is, the part of the fitting value borne by the variable) as the ordinate. If all the images are linear, it shows that the linear relationship is good; if the component residual map of a variable is nonlinear, it means that the variable needs to add a polynomial term.
④ homoscedasticity
You can use the ncvTest () function to test the invariance of variance, as shown below:
NcvTest (fit)
The zero hypothesis of the modified test is that the error is constant and the p value is greater than 0.05 and the homoscedasticity test passes. In addition, the spreadLevelPlot () function plots the relationship between the residual and the fitting value, and gives suggestions for data conversion:
SpreadLevelPlot (fit)
As shown in the following figure, as the fitting value increases at first and then decreases (according to the actual situation, this is probably due to the non-uniformity of variables), the suggestion given by the test results is to carry out a 2.59th power transformation of the response variable data (i.e. power transformation).
⑤ multicollinearity
When using multiple explanatory variables for regression modeling, sometimes the significance of the whole model is very good, but the test of regression coefficient is not significant, which is likely to have the problem of multicollinearity, that is, there is a strong correlation between explanatory variables. Because the regression coefficient is actually measured by the influence of the explanatory variable on the dependent variable when other explanatory variables are fixed, collinearity will lead to too large confidence interval of regression parameters, so it is difficult to explain a single coefficient.
In ecological analysis, there is likely to be collinearity among environmental factors, which is very important for models based on multiple regression, such as RDA, CCA, CAP, etc., because these methods use regression coefficient as an index to measure the influence of explanatory variables, and VPA analysis needs to eliminate collinearity in order to test the significance of the variance of each part, as is the case with LDA analysis. In 3.3.2.1RDA analysis, we use the statistic VIF (variance inflation factor, Variance expansion Factor) to detect. VIF actually measures the degree to which the confidence interval of regression parameters can expand into model-independent explanatory variables. Generally, VIF > 4 considers that there is a problem of multicollinearity. The test methods are as follows:
Vif (fit)
It can be seen from the results that the problem of collinearity is not serious.
⑥ filter special points
The point of poor prediction effect of the model in the response variable is called the outlier point, the abnormal predictive variable value in the prediction variable is the high leverage point, and the point that has too great influence on the model parameters is called the strong influence point, that is, removing this observation point will change the model greatly. For a model, we naturally hope that the influence of every point is the same. generally speaking, the strong influence point is both outlier and high leverage point. In the previous diagnosis, the detection of outliers and high leverage points has been initially involved. here are some other detection methods:
# detect outliers (corrected p < 0.05is outliers) outlierTest (fit)
# Visualization method to detect special points influencePlot (fit, id.method= "identify", ylim=c (- 3prime3))
The Abscissa of the image above is hat statistics, the larger the lever value is, the larger the vertical axis is residual, the larger the absolute value is, the more likely it is to be an outlier, and the larger the circle is, the greater the influence of this point is.
This is the end of the content of "how to achieve multiple linear regression in R language". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.