In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how to program a linear regression model in Python". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to program a linear regression model in Python".
Interpretability is one of the biggest challenges in machine learning. If the decision of one model is easier to understand, then its explanation will be higher than that of another model. Some models are so complex and their internal structures are so complex that it is almost impossible to understand how they achieve the final result. These black boxes seem to break the link between the original data and the final output because of the multiple processes that take place between them.
But in the field of machine learning algorithms, some models are more transparent than others. Decision tree is definitely one of them, and linear regression model is one of them. Their simple and straightforward approach makes them ideal tools for solving different problems. Let's see how.
You can use a linear regression model to analyze how the salary at a given location depends on experience, education, position, city of work, and so on. Similarly, you can analyze whether real estate prices depend on factors such as size, the number of bedrooms or the distance from the city center.
Simple linear regression (SLR)
This is the simplest form of linear regression when the output variable (target) has only one input variable (prediction variable):
Input or prediction variables are variables that help predict the value of output variables. It is often called X.
The output or target variable is the variable we want to predict. It is usually called y.
The value of β 0 (also known as intercept) shows the point at which the estimated regression line intersects the y axis, while the value of β 1 determines the slope of the estimated regression line. The random error describes the dependent variable and the independent variable (the disturbance of the model, the random component of the linear relationship between parts cannot be explained by the X). The real regression model is usually unknown (because we cannot capture all the effects of the dependent variables), so the value of the random error term corresponding to the observed data point is still unknown. However, the regression model can be estimated by calculating the parameters of the model for the observed data set.
The idea behind the regression is to estimate the parameters β 0 and β 1 from the sample. If we can determine the optimal values of these two parameters, then given the value of X, we will have the best fitting line, which can be used to predict the value of y. In other words, we try to fit a line to observe the relationship between input and output variables, and then use it to predict the output of unseen inputs.
How do we estimate β 0 and β 1? We can use a method called ordinary least squares (OLS). The purpose of this is to make the distance between the black dot and the red line as close to zero as possible, which is achieved by minimizing the square difference between the actual result and the predicted result.
The difference between actual and predicted values is called residual (e) and can be negative or positive depending on whether the model overestimates or underestimates the results. Therefore, in order to calculate the net error, directly adding all the residuals will lead to the offset of the term and the reduction of the net effect. To avoid this, we use the sum of the squares of these errors, which is called the sum of squares of residuals (RSS).
The sum residue of the square minimized by the ordinary least square method (OLS) is designed to minimize the distance from the observation (measured by a quadratic value) to the predictor (regression line) with a suitable regression line.
Multiple linear regression (MLR)
Is a form of linear regression used when there are two or more predictive variables or input variables. Similar to the SLR model described earlier, it contains other predictive variables:
Note that this equation is only an extension of the simple linear regression equation, where each input / prediction variable has its corresponding slope coefficient (β). The first β term (β 0) is the intercept constant, which is the y value without all predictive variables (that is, when all X terms are 0).
As the number of features increases, so does the complexity of our model, and it becomes more difficult to visualize or even understand our data. Because there are more parameters in these models than SLR, more attention is needed. When working with them. Adding more terms will essentially improve the fit of the data, but the new terms may not have any practical meaning. This is dangerous because it may cause the model to fit the data, but it doesn't really mean it's useful.
An example
The advertising data set includes sales of products in 200 different markets and advertising budgets for three different media (television, radio and newspapers). We will use data sets to predict sales (independent variables) based on television, radio and newspaper advertising budgets (independent variables).
In mathematics, the formula we will try to solve is:
By minimizing the error function and fitting the best line or hyperplane (depending on the number of input variables), the regression model can find the values of these constants (β). Let's code.
Load the data and describe the dataset
Before loading the data, we will import the necessary libraries:
Import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn import metricsfrom sklearn.metrics import r2_scoreimport statsmodels.api as sm
Now let's load the dataset:
Df = pd.read_csv ("Advertising.csv")
Let's understand the dataset and describe it:
Df.head ()
Since it is not needed, we will delete the first column ("unnamed"):
Df = df.drop (['Unnamed: 0'], axis=1) df.info ()
Now, our dataset contains 4 columns (including the target variable "sales"), 200 registers, and no missing values. Let's visualize the relationship between independent variables and target variables.
Sns.pairplot (df)
The relationship between television and sales seems to be strong, although there seems to be a trend between radio and sales, but the relationship between newspapers and sales does not seem to exist. We can also verify it numerically by means of related graphs:
Mask= np.tril (df.corr ()) sns.heatmap (df.corr (), fmt='.1g', annot=True, cmap= 'cool', mask=mask)
As we expected, the strongest positive correlation occurs between sales and television, while the relationship between sales and newspapers is close to zero.
Select features and target variables
Next, we divide the variables into two groups: dependent variables (or target variables "y") and independent variables (or characteristic variables "X").
X = df.drop (['sales'], axis=1) y = df [' sales'] split the dataset
In order to understand the performance of the model, it is a good strategy to divide the data set into training set and test set. By dividing the dataset into two separate sets, we can train with one set and test the performance of the model with invisible data from the other.
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.3, random_state=0)
We divided the dataset into 70% training and 30% testing. The random_state parameter is used to initialize the internal random number generator, which splits the data into the training index and the test index, depending on your situation. I set the random state to 0 so that you can use the same parameter to compare output in multiple code runs.
Print (Xerogramma. Shaperec. Shaperec. Shape.shaperex. Shapereel. Shape)
By printing the shape of the split set, we see that we have created:
Two data sets, each with 140 registers (accounting for 70% of the total registers), one containing three independent variables and one data set containing only target variables, will be used to train and generate linear regression models.
Two datasets, each with 60 registers (accounting for 30% of the total registers), one dataset containing three independent variables and one containing only target variables, which will be used to test the performance of the linear regression model.
Establish a model
Modeling is very simple:
Mlr = LinearRegression () training model
Making the model suitable for training data represents the training part of the modeling process. After training, you can use the prediction method to call the model to make a prediction:
Mlr.fit (X_train, y_train) coeff_df = pd.DataFrame (mlr.coef_, X.columns, columns = ['Coefficient']) coeff_df
Let's look at the output of the model after training and look at the value of β 0 (intercept):
Mlr.intercept_
We can also print the value of the coefficient (β):
Coeff_df = pd.DataFrame (mlr.coef_, X.columns, columns = ['Coefficient']) coeff_df
In this way, we can now estimate the value of "sales" based on different budgets for television, radio and newspapers:
For example, if we determine that the budget value for television is 50, the budget value for radio is 30, and the budget value for newspapers is 10, the estimated value of "sales" will be:
Example = [50,30,10] output = mlr.intercept_ + sum (example*mlr.coef_) output
Test model
A test dataset is a dataset independent of the training dataset. This test dataset is an invisible dataset of your model and helps you better understand its generalization capabilities:
Y_pred = mlr.predict (X_test) evaluate performance
The quality of the model is related to the degree to which the prediction matches the actual values of the test data set:
Print ('Mean Absolute Error:', metrics.mean_absolute_error (y_test, y_pred)) print (' Mean Squared Error:', metrics.mean_squared_error (y_test, y_pred)) print ('Root Mean Squared Error:', np.sqrt (metrics.mean_squared_error (y_test, y_pred) print (' R Squared Score is:', r2_score (y_test, y_pred))
After validating our model against the test set, we got a R ²of 0.86, which seems to be a pretty good performance score. However, although the higher R ²indicates that it is more suitable for the model, this is not always the case. We will see some ways to explain and improve the regression model below.
How to explain and improve your model?
Okay, we created the model, and now what? Let's take a look at the model statistics on the training data to get some answers:
X2 = sm.add_constant (X_train) model_stats = sm.OLS (y_train.values.reshape (- 1), X2). Fit () model_stats.summary ()
Let's take a look at what these numbers mean.
Hypothesis testing
One of the basic questions you should answer when running the MLR model is that at least one prediction variable is useful for predicting output. What if the relationship between the independent variable and the goal is only accidental and has no actual impact on sales due to any predictor?
We need to perform a hypothesis test to answer this question and check our hypothesis. It all starts with the formation of a zero hypothesis (H0), which states that all coefficients are equal to zero and that there is no relationship between the predicted variable and the target variable (which means that a model without independent variables is suitable for both data and your model):
On the other hand, we need to define an alternative hypothesis (Ha), which states that at least one coefficient is not zero and that there is a relationship between the prediction variable and the target (which means that the model is more suitable for data than intercept):
If we want to reject the zero hypothesis and have confidence in our regression model, we need to find strong statistical evidence. For this reason, we perform hypothesis testing, for which we use F statistics.
If the value of the F statistic is equal to or very close to 1, then the result supports the zero hypothesis and we cannot reject it.
As we can see in the above table (marked in yellow), the F statistic is 439.9, which strongly proves the zero hypothesis (all coefficients are zero). Next, we also need to check the probability of occurrence of the F statistic (also marked in yellow) under the assumption that it is 8.76e-70 (a very small number less than 1%). This means that under the valid Null hypothesis, the probability of accidental occurrence of F statistic 439.9 is less than 1%.
Having said that, we can deny the zero hypothesis and believe that at least one predictive variable can be used to predict the output.
Generation model
Running a linear regression model with many unrelated variables will lead to unnecessary complex models. Which predictors are important? Are they all important to our models? To find out, we need to perform a process called function selection. The two main methods of feature selection are:
Positive selection: add one predictive variable at a time, starting with the predictive variable that has the highest correlation with the variable. Then, the variables of greater theoretical importance are sequentially merged into the model until the stop rule is reached.
Eliminate backward: start with all variables in the model, and then delete the variables with the least statistical significance (the larger p value) until the stop rule is reached.
Although both methods can be used, the backward elimination method is usually preferred unless the number of prediction variables is greater than the sample size (or the number of events).
Assumption condition
Because linear regression models are approximations of the long-term sequence of any event, they need to make some assumptions about the data they represent in order to remain appropriate. Most statistical tests rely on certain assumptions about the variables used in the analysis, and if these assumptions are not met, the results may not be credible (for example, resulting in type I or II errors).
In the sense that the output is a linear combination of input variables, the linear regression model is linear and is only suitable for modeling linearly separable data. Linear regression models work under a variety of assumptions that must exist to produce appropriate estimates, rather than just relying on accuracy scores:
Linear: the relationship between features and targets must be linear. One way to check the linear relationship is to visually check the linearity of the scatter graph. If the relationship shown in the scatter chart is not linear, then we need to run a nonlinear regression or transform the data.
Mean square error: for any x value, the variance of the residual must be the same. Multiple linear regression assumes that the error of residual is similar at every point of the linear model. This situation is called homology. Scatter plots are a good way to check whether the data is covariance, and there are some tests to numerically verify the hypothesis (for example, Goldfeld-Quandt,Breusch-Pagan,White)
No multiple collinearity: data should not show multiple collinearity, which occurs when independent variables (explanatory variables) are highly related to each other. If this happens, it will be difficult to find the specific variable that causes the difference between the dependent variable / target variable. This hypothesis can be tested using the Variance expansion Factor (VIF) method or through the correlation matrix. An alternative way to solve this problem may be to center the data (deducting the average score), or to perform factor analysis and rotate the factor to ensure the independence of the factor in the linear regression analysis.
No autocorrelation: the values of residuals should be independent of each other. The existence of correlation in the residual will greatly reduce the accuracy of the model. If the error term is related, the estimated standard error tends to underestimate the true standard error. To test this assumption, you can use Durbin-Watson statistics.
Normality of residuals: residuals must be normally distributed. You can use goodness-of-fit tests, such as Kolmogorov-Smirnov or Shapiro-Wilk tests, to check normality, and if the data is not normally distributed, nonlinear transformations, such as logarithmic conversions, can solve this problem.
Assumptions are critical, because if the assumptions are invalid, the analysis process will be considered unreliable, unpredictable, and uncontrolled. Failure to comply with these assumptions can lead to invalid conclusions or data that are scientifically unfounded.
The last thought
Although MLR models extend the scope of SLR models, they are still linear models, which means that the terms contained in the models cannot show any nonlinear relationship to each other or represent any kind of nonlinear trend. You should also be careful when predicting points outside the scope of observation of the elements, as the relationship between variables may change as you move out of range (a fact that you cannot know because you do not have data).
The observed relationship may be locally linear, but there may be an unobserved nonlinear relationship in the external range of the data.
Linear models can also model curvature by including nonlinear variables (such as polynomials) and transformation exponential functions. The parameters of the linear regression equation are linear, which means that you can increase the argument exponentially to fit the curve, but still remain in the "linear world". Linear regression models can contain logarithmic terms and inverse terms to follow different types of curves, but the parameters remain linear.
Although the independent variable is square, the parameters of the model are still linear.
Regression such as polynomial regression can model nonlinear relationships, while linear equations have a basic form, while nonlinear equations can take many different forms. The reason you might consider using a nonlinear regression model is that although linear regression can model curves, it may not be able to model specific curves that exist in the data.
You should also know that OLS is not the only way to fit a linear regression model, while other optimization methods, such as Gradient Descent, are more suitable for large datasets. Applying OLS to complex and nonlinear algorithms may not scale, and Gradient Descent may be computationally cheaper (faster) to find a solution. Gradient descent (Gradient Descent) is an algorithm that minimizes a function, and given a function defined by a set of parameters, the algorithm starts with an initial set of parameter values and then gradually minimizes the function to a set of parameter values. This iterative minimization is achieved using derivatives to take steps in the negative direction of the functional gradient.
Linear regression using gradient descent
Another key to consider is that outliers can have a huge impact on the regression line and correlation coefficients. In order to identify them, it is important to perform exploratory data analysis (EDA) and examine the data to detect abnormal observations, as they greatly affect the results of our analysis and statistical modeling. If you identify any anomalies, you can estimate the outliers (for example, using the mean / median / mode), set an upper limit (replace those values that exceed certain limits), or replace with missing values and predict.
Finally, some limitations of the linear regression model are:
An omitted variable. There must be a good theoretical model to put forward the variables to explain the dependent variables. In the case of a simple bivariate regression, other factors that may explain the dependent variable must be considered, as there may be other "unobserved" variables to explain the output.
Reverse causality. Many theoretical models predict two-way causality-that is, dependent variables may lead to changes in one or more explanatory variables. For example, higher income may enable people to invest more in their education, thus increasing their income. This complicates the way of estimating regression and requires special techniques.
Measurement error. Factors may be misevaluated. For example, ability is difficult to measure, and IQ tests have well-known problems. As a result, the use of IQ regression may not properly control talent, resulting in an inaccurate or biased relationship between variables such as education and income.
The focus is too limited. The regression coefficient provides only information about the relationship between a small change (not a big change) in one variable and a change in another. It shows how small changes in education may affect income, but it does not allow researchers to summarize the impact of larger changes. If everyone receives a college education at the same time, recent college graduates are unlikely to earn more money, because the total supply of college graduates will greatly increase.
Thank you for your reading, the above is the content of "how to program a linear regression model in Python". After the study of this article, I believe you have a deeper understanding of how to program a linear regression model in Python. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.