How to realize Linear regression with scikit-learn and pandas 07/12 Update SLTechnology News&Howtos

How to realize Linear regression with scikit-learn and pandas

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use scikit-learn and pandas to achieve linear regression. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

1. Get data, define problems

Without data, it is of course impossible to study machine learning. :) here we use machine learning data published by UCI University to run linear regression.

Inside is a circular power field data, a total of 9568 sample data, each with 5 columns, namely: AT (temperature), V (pressure), AP (humidity), RH (pressure), PE (output power). We don't have to dwell on every specific meaning.

Our problem is to get a linear relationship, corresponding to PE is the sample output, and AT/V/AP/RH these four are sample features, the purpose of machine learning is to obtain a linear regression model, that is:

PE= θ 0 + θ 1 ∗ AT+ θ 2 ∗ V + θ 3 ∗ AP+ θ 4 ∗ RH

The five parameters that need to be learned are θ 0, θ 1, θ 2, θ 3, and θ 4.

two。 Collate the data

The downloaded data can be found to be a compressed file, and after decompression, you can see that there is a xlsx file inside. We first open it with excel, then "Save as" csv format, save it, and then use this csv to run linear regression.

When you open this csv, you can find that the data has been sorted out and there is no illegal data, so there is no need for preprocessing. But these data are not normalized, that is, they are transformed into the format of mean 0 and variance 1. There is no need for us to do it. Later, scikit-learn will help us normalize it in the case of linear regression.

Well, with this data in csv format, we can have a big fight.

3. Use pandas to read data

Let's open ipython notebook and create a new notebook. Of course, you can type directly from the interactive command line of python, but it is recommended to use notebook. The following examples and outputs are all run in notebook.

Declare the library to be imported first:

Import matplotlib.pyplot as plt

% matplotlib inline

Import numpy as np

Import pandas as pd

From sklearn import datasets, linear_model

Then we can read the data with pandas:

The parameter in # read_csv is the path of csv on your computer, where the csv file is placed in the CCPP directory under the notebook directory.

Data = pd.read_csv ('.\ CCPP\ ccpp.csv')

Test whether the data was read successfully:

# read the first five lines of data. If it is the last five lines, use data.tail ()

Data.head ()

The running result should be as follows. You can see the following data, which means that pandas read the data successfully:

AT V AP RH PE

0 8.34 40.77 1010.84 90.01 480.48

1 23.64 58.49 1011.40 74.20 445.75

2 29.74 56.90 1007.15 41.91 438.76

3 19.07 49.69 1007.22 76.79 453.09

4 11.80 40.66 1017.13 97.20 464.43

4. Prepare the data to run the algorithm

Let's look at the dimensions of the data:

Data.shape

The result is 9568, 5). It means we have 9568 samples, each with five columns.

Now we begin to prepare the sample feature X, and we use the four columns AT, VMagine AP and RH as the sample features.

X = data [['AT',' Vince, 'AP',' RH']]

X.head ()

You can see that the first five outputs of X are as follows:

AT V AP RH

0 8.34 40.77 1010.84 90.01

1 23.64 58.49 1011.40 74.20

2 29.74 56.90 1007.15 41.91

3 19.07 49.69 1007.22 76.79

4 11.80 40.66 1017.13 97.20

Then we prepare the sample output y, and we use PE as the sample output.

Y = data [['PE']]

Y.head ()

You can see that the first five outputs of y are as follows:

0 480.48

1 445.75

2 438.76

3 453.09

4 464.43

5. Divide the training set and the test set

We divide the sample combination of X and y into two parts, one is the training set, the other is the test set, and the code is as follows:

From sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split (X, y, random_state=1)

Check the dimensions of the training set and the test set:

Print X_train.shape

Print y_train.shape

Print X_test.shape

Print y_test.shape

The results are as follows:

(7176, 4)

(7176, 1)

(2392, 4)

(2392, 1)

It can be seen that 75% of the sample data is used as the training set and 25% of the samples are used as the test set.

6. A linear model for running scikit-learn

Finally, it's time for us to use scikit-learn 's linear model to fit our problem. Scikit-learn 's linear regression algorithm is realized by least square method. The code is as follows:

From sklearn.linear_model import LinearRegression

Linreg = LinearRegression ()

Linreg.fit (X_train, y_train)

After fitting, let's take a look at the model coefficient results we need:

Print linreg.intercept_

Print linreg.coef_

The output is as follows:

[447.06297099]

[- 1.97376045-0.23229086 0.0693515-0.15806957]

So we get the five values that need to be obtained in step 1. In other words, the relationship between PE and the other four variables is as follows:

PE=447.06297099 − 1.97376045 ∗ AT − 0.23229086 ∗ Venture 0.0693515 ∗ AP − 0.15806957 ∗ RH

7. Model evaluation

We need to evaluate the quality of our model. For linear regression, we generally evaluate the model by the performance of Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) on the test set.

Let's look at the MSE and RMSE of our model, the code is as follows:

# Model fitting test set

Y_pred = linreg.predict (X_test)

From sklearn import metrics

# calculate MSE with scikit-learn

Print "MSE:", metrics.mean_squared_error (y_test, y_pred)

# calculate RMSE with scikit-learn

Print "RMSE:", np.sqrt (metrics.mean_squared_error (y_test, y_pred))

The output is as follows:

MSE: 20.0804012021

RMSE: 4.48111606657

If we get MSE or RMSE, if we get different coefficients by other methods, when we need to choose a model, we use the corresponding parameters when MSE is small.

For example, this time we use the three columns AT and VQuery AP as the sample features. Do not RH, the output is still PE. The code is as follows:

X = data [['AT',' Vince, 'AP']]

Y = data [['PE']]

X_train, X_test, y_train, y_test = train_test_split (X, y, random_state=1)

From sklearn.linear_model import LinearRegression

Linreg = LinearRegression ()

Linreg.fit (X_train, y_train)

# Model fitting test set

Y_pred = linreg.predict (X_test)

From sklearn import metrics

# calculate MSE with scikit-learn

Print "MSE:", metrics.mean_squared_error (y_test, y_pred)

# calculate RMSE with scikit-learn

Print "RMSE:", np.sqrt (metrics.mean_squared_error (y_test, y_pred))

The output is as follows:

MSE: 23.2089074701

RMSE: 4.81756239919

It can be seen that after removing the RH, the model fitting is not as good as adding RH, and the MSE becomes larger.

8. Cross validation

We can continuously optimize the model through cross-validation. The code is as follows. We use 10-fold cross-validation, that is, the cv parameter in cross_val_predict is 10:

X = data [['AT',' Vince, 'AP',' RH']]

Y = data [['PE']]

From sklearn.model_selection import cross_val_predict

Predicted = cross_val_predict (linreg, X, y, cv=10)

# calculate MSE with scikit-learn

Print "MSE:", metrics.mean_squared_error (y, predicted)

# calculate RMSE with scikit-learn

Print "RMSE:", np.sqrt (metrics.mean_squared_error (y, predicted))

The output is as follows:

MSE: 20.7955974619

RMSE: 4.56021901469

As you can see, the MSE using the cross-validation model is larger than that in section 6, mainly because we do the MSE corresponding to the predicted value of the test set for all the broken samples, while section 6 only does MSE for 25% of the test set. The prerequisites for the two are different.

9. Drawing and observation results

Here is the relationship between the true value of the drawing and the predicted value, and the closer the point is to the middle straight line, the lower the predicted loss. The code is as follows:

Fig, ax = plt.subplots ()

Ax.scatter (y, predicted)

Ax.plot ([y.min (), y.max ()], [y.min (), y.max ()], 'Kmuri Murray, lw=4)

Ax.set_xlabel ('Measured')

Ax.set_ylabel ('Predicted')

Plt.show ()

The output image is as follows:

On how to use scikit-learn and pandas to achieve linear regression to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.