A case study of linear regression in r language 04/19 Update SLTechnology News&Howtos

A case study of linear regression in r language

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "r language linear regression case analysis". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor to take you to learn "r language linear regression case analysis" bar!

Regression analysis is a widely used statistical tool, which uses the existing experimental data to quantitatively describe the relationship between variables through an equation, in which the variables can be divided into two categories.

Independent variable, also known as predictive variable

Dependent variable, also known as response variable

There can be multiple independent variables, while there is only one dependent variable. The essence of regression is to construct the equation between the dependent variable and the independent variable. There are two classical uses of regression analysis, the first is modeling and prediction, which predicts the new data set through the constructed regression equation, and the second is used to quantitatively describe the correlation between variables. In GWAS, it is the use of regression analysis, this paper first looks at linear regression.

As the name implies, linear regression uses linear equations to describe the relationship between variables, which can be divided into univariate linear regression and multiple linear regression according to the number of independent variables. The unary and multivariate here refer to the number of independent variables. Taking the univariate linear regression as an example, the equation is as follows

Y = ax + b + c

Where x is independent variable, y is dependent variable, an is called regression coefficient, b is called regression constant. C is called error, also known as residual. An and b are collectively called regression parameters. The purpose of linear regression is to solve the regression parameters. Taking the linear relationship between height and weight as an example, the data are as follows

Its distribution is as follows

It can be felt intuitively from the graph that the two are a linear relationship. The essence of linear regression is to fit a best straight line according to the actual data. The best here is very important. For the same data, multiple straight lines can be fitted, as shown below.

The effect of the two lines in the picture looks similar, so how to quantitatively compare the fitting effect of different lines, so as to choose the best one?

There are usually two methods, the first is called the least square method, which uses the difference between the actual value and the fitting value, that is, the residual value, to construct the statistics to measure the fitting effect, as shown below.

The scatter in the graph is the actual observation, the fitting value is on the straight line, and the line segment between the actual observation and the fitting value represents the residual. The corresponding statistics are the sum of squares of the residuals, as follows

Residual sum of squares (RSS)

Sum of squared estimate of errors (SSE)

Sum of squared residuals (SSR)

The calculation formula is as follows

It can be regarded as a solution of Euclidean distance, and the least square method takes the line with the least sum of square residuals as the best straight line. The second method is called maximum likelihood method, which is actually probability. For the fitted line, the probability of the actual observed value is calculated, and the probability value is taken as the marker of the fitting effect, and the line with the highest probability is considered to have the best fitting effect.

Among them, the least square method can be regarded as a special case of maximum likelihood, which can be deduced from maximum likelihood. In simple linear regression, least square method is widely used. Taking R language as an example, the code for univariate linear regression is as follows

Intercept is called intercept, which corresponds to the regression constant in the regression equation. for the independent variable height, the regression coefficient is 0.6746. Here we get the final regression parameters directly. In fact, there are many details here, which can be viewed through summary.

The first is the distribution of residuals, represented by five numbers, namely, the minimum, the first quartile, the median, the third quartile, and the maximum. In R, the quantile function can be used to calculate.

The second is the test of regression parameters, through t-test to analyze the correlation between each variable and dependent variable in the regression equation, corresponding to the part of Pr (> | t |), p value less than 0.01is considered to be relevant.

The third residual standard error, residual standard error, is a statistic to measure the degree of population dispersion. The formula is as follows.

The residual standard error can be obtained by dividing the sum of residual squares by the degree of freedom and then opening the root sign, so the corresponding residual standard error of the best fitting line should also be the minimum.

The fourth one is R _ 2 ~ ~ R _ mursquared. The calculation formula is as follows.

SST is the variance of the actual observed value, SSR is the variance of the fitting value, R2 is the ratio of the variance of the fitted value to the variance of the actual observed value, and the range of values is 0-1. R2 is also called goodness of fit. The closer the value is to 1, the better the fitting effect is. For the solution of a regression equation, the standard error of deviation and the value of R2 are determined. for the best fitting line, the standard error of residual error must be the minimum and the value of R2 must be the maximum.

In addition to characterizing the fitting effect, R2 has another use, which is to represent the correlation between independent variables and dependent variables, which is only suitable for univariate linear regression, when the value of R2 is the square of the correlation coefficient of independent variable x and dependent variable y. Therefore, in the correlation analysis of the unit point, the sites with strong correlation can be screened according to the value of R2.

There is also a corrected R2, and the calculation formula is as follows

The last one is the significance test of the whole equation, which is judged by F test. In GWAS, linear regression can be used to analyze the association between SNP loci and continuous phenotypic traits, and pvalue can be used to determine significant association loci. Furthermore, snp loci with strong association can be screened according to R2.

At this point, I believe you have a deeper understanding of the "r language linear regression case analysis". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.