Python Mathematical Modeling StatsModels Statistical regression Model data preparation 07/02 Update SLTechnology News&Howtos

Python Mathematical Modeling StatsModels Statistical regression Model data preparation

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "what are the preparations for Python mathematical modeling StatsModels statistical regression model data". In daily operation, I believe that many people have doubts about the preparation of Python mathematical modeling StatsModels statistical regression model data. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "Python mathematical modeling StatsModels statistical regression model data preparation". Next, please follow the editor to study!

Catalogue

1. Read the data file

(1) read the .csv file:

(2) read the .xls file:

(3) read .txt file:

2. Split and merge of data files

(1) split the Excel file into multiple files

(2) merge multiple Excel files into one file

3. Data preprocessing.

(1) processing of missing data

(2) processing of duplicate data

(3) abnormal value handling

4. Python routine (Statsmodels)

4.1 problem description

4.2 Python program

4.3 results of running the program:

1. Read the data file

The data used in regression analysis is saved in the data file, so the first step is to read the data from the data file.

There are many formats of data files, the most commonly used are .csv, .xls and .txt files, as well as reading sql database files.

The program that imports data from a data file using pandas is the simplest, as shown in the following example:

(1) read .csv file: df = pd.read_csv (". / example.csv", engine= "python", encoding= "utf_8_sig") # engine= "python" allows processing of Chinese paths, encoding= "utf_8_sig" allows reading Chinese data (2) read .xls file: df = pd.read_excel (". / example.xls", sheetname='Sheet1', header=0, encoding= "utf_8_sig") # sheetname indicates read sheet Header=0 represents the header line of the first behavior, encoding represents the encoding method (3) reads the .txt file: df = pd.read_table (". / example.txt", sep= "\ t", header=None) # sep represents the delimiter, header=None indicates the unheaded line, and the first row is data 2, the split and merge of the data file.

The amount of data that needs to be processed by statistical regression may be very large. If necessary, the files need to be split or merged, or they can be processed with pandas. Examples are as follows:

(1) split the Excel file into multiple files # split the Excel file into multiple files import pandas as pd dfData = pd.read_excel ('. / example.xls', sheetname='Sheet1') nRow, nCol = dfData.shape # obtain data rows # assume that the data has 198000 lines and is divided into 20 files 10000 lines per file: for i in range (0, int (nRow/10000) + 1): saveData = dfData.iloc [iTunes 10 fileName= 1: (iDev 1) * 10 000 F1,:] # every 10000 fileName='. / example_ {} .xls' .format (str (I)) saveData.to_excel (fileName, sheet_name = 'Sheet1' Index = False) (2) merge multiple Excel files into one file # merge multiple Excel files into one file import pandas as pd # # merge two Excel files # data1 = pd.read_excel ('. / example0.xls', sheetname='Sheet1') # data2 = pd.read_excel ('. / example1.xls', sheetname='Sheet1') # data = pd.concat ([data1 Data2]) # multiple Excel files merge dfData = pd.read_excel ('. / example0.xls', sheetname='Sheet1') for i in range (1,20): fileName ='. / example_ {} .xls' .format (str (I)) dfNew = pd.read_excel (fileName) dfData = pd.concat ([dfData, dfNew]) dfData.to_excel ('. / example', index = False) # = follow Youcans Share the original series https://blog.csdn.net/youcans = 3, data preprocessing

In practical work, the original data should be preprocessed (data preprocessing) before modeling and fitting analysis, including missing value processing, repeated data processing, abnormal value processing, variable format conversion, training set division, data standardization, normalization and so on.

Many aspects of data preprocessing are beyond the scope of Statsmodels. Here are only the most basic methods:

(1) processing of missing data

Missing imported data often occurs, and the easiest way to deal with it is to delete the missing data rows. Delete rows or columns with missing values using .dropna () in pandas, or delete specific columns with missing values.

DfNew = dfData.dropna (axis = 0) # Delete rows with missing values

Sometimes missing values are filled or replaced, so I won't cover them here.

(2) processing of duplicate data

For duplicate data, duplicate rows are usually deleted. Use .duplicated () in pandas to query the contents of duplicate data, .drop _ duplicated () to delete duplicate data, or to deduplicate specified data columns.

DfNew = dfData.drop_duplicates (inplace=True) # Delete duplicate rows (3) exception handling

Abnormal values may be included in the data, which means that the values in one sample deviate significantly from the observed values of other samples in the sample set, also known as outliers. Outliers can be identified by box diagram and normal distribution map, as well as by regression and clustering modeling.

Box chart technology is to use the quantile of data to identify outliers. The box chart analysis also exceeds the content of this article and can not be introduced in detail. It can only be generally said that by observing the box chart, we can see the overall abnormal situation, and then find the outliers.

DfData.boxplot () # draw box diagram

For abnormal values, it is usually not easy to delete directly, which needs to be considered and dealt with according to the specific situation. Using .drop () in pandas, you can delete outlier rows directly, or use criteria to determine and delete outlier rows.

# delete by line, drop () default axis=0 delete dfNew = dfData.drop (labels=0) # according to line number labels, delete line with line number 0 dfNew = dfData.drop (index=dfData [dfData ['A'] = =-1] .index [0]) # search according to conditions, delete line 4 of dfData ['A'] =-1, Python routine (Statsmodels) 4.1 problem description

The data file collects the sales volume, price, advertising cost and market average price of toothpaste in the past 30 months.

(1) analyze the relationship between toothpaste sales, price and advertising investment, and establish a mathematical model.

(2) estimate the parameters of the mathematical model and make statistical analysis.

(3) the fitting model is used to predict the sales of toothpaste under different prices and advertising costs.

This question and data come from: Jiang Qiyuan, Xie Jinxing, Mathematical Model (3rd Edition), higher Education Press.

It should be noted that this routine is not the best method and result to solve the problem, but only uses the problem and data to demonstrate the method of reading data files and data processing.

4.2 Python program # LinearRegression_v3.py# v1.0: call statsmodels to achieve univariate linear regression # v2.0: call statsmodels to achieve multiple linear regression # v3.0: read data samples from files # date: 2021-05-0 Copyright 2021 YouCans XUPTimport numpy as npimport pandas as pdimport statsmodels.api as smimport matplotlib.pyplot as pltdef main (): # read data file readPath = ".. / data/toothpaste.csv" # address and file name of the data file try: if (readPath [- 4:] = = ".csv"): dfOpenFile = pd.read_csv (readPath, header=0, sep= ",") # spacer is a comma First behavior title line # dfOpenFile = pd.read_csv (filePath, header=None, sep= ",") # sep: spacer Untitled line elif (readPath [- 4:] = ".xls") or (readPath [- 5:] = = ".xlsx"): # sheet_name defaults to 0 dfOpenFile = pd.read_excel (readPath, header=0) # first behavior title line # dfOpenFile = pd.read_excel (filePath, header=None) # untitled line elif (readPath [- 4:] = ".dat"): # sep: delimiter Header: whether the first line is the title line dfOpenFile = pd.read_table (readPath, sep= ", header=0) # spacer is a space, the first behavior title line # dfOpenFile = pd.read_table (filePath,sep=", ", header=None) # separator is a comma, headless line else: print (" unsupported file format ".) Print (dfOpenFile.head ()) except Exception as e: print ("failed to read data file: {}" .format (str (e) return # data preprocessing dfData = dfOpenFile.dropna () # Delete data with missing values print (dfData.dtypes) # View the data type print (dfData.shape) of each column of df # View the number of rows and columns of df # colNameList = dfData.columns.tolist () # convert the df column name to a list list # print (colNameList) # View the column name list list # featureCols = ['price' 'average',' advertise', 'difference'] # filter column Establish independent variable column name list # X = dfData [['price',' average', 'advertise',' difference']] # based on independent variable column name list, establish independent variable data set # prepare modeling data: analyze the relationship between dependent variable Y (sales) and independent variable x1~x4 y = dfData.sales # based on dependent variable column name list Establish the dependent variable data set x0 = np.ones (dfData.shape [0]) # intercept column x0 = [1Magne.1] x1 = dfData.price # sales price x2 = dfData.average # market average price x3 = dfData.advertise # advertising fee x4 = dfData.difference # price difference X4 = x1-x2 X = np.column_stack ((x0memex1reportx2recoveryx3recoverx4)) # [x0recedicsx1memorixx4] # Establishment of the model and parameter estimation # Model 1Ranger Y = b0 + b1*X1 + b2*X2 + b3*X3 + b4*X4 + e model = sm.OLS (y X) # establish OLS model results = model.fit () # return model fitting result yFit = results.fittedvalues # y value of model fitting print (results.summary ()) # output summary of regression analysis print ("\ nOLS model: y = b0 + b1 roomX +... + bm*Xm") print ('Parameters:' Results.params) # output: coefficients of the fitting model # fitting result drawing fig, ax = plt.subplots (figsize= (10,8)) ax.plot (range (len (y)), y, 'bo', label='sample') ax.plot (range (len (yFit)), yFit,' Raquaquet, label='predict') ax.legend (loc='best') # display legend plt.show () # YouCans XUPT returnif _ _ name__ = ='_ main__': main () 4.3.The result of the program is: period price average advertise difference sales0 1 3.85 3.80 5.50-0.005 7.381 2 3.75 4.00 6.75 8.512 3 3.70 4.30 7.25 0.60 9.523 4 3.70 3.70 5.50 0.00 7.504 5 3.60 3.85 7.00 9.33OLS Regression Results = Dep. Variable: sales R-squared: 0.895Model: OLS Adj. R-squared: 0.883Method: Least Squares F-statistic: 74.20Date: Fri 07 May 2021 Prob (F-statistic): 7.12e-13Time: 11:51:52 Log-Likelihood: 3.3225No. Observations: 30 AIC: 1.355Df Residuals: 26 BIC: 6.960Df Model: 3 Covariance Type: nonrobust = coef std err t P > | t | [0.025 0.975] -const 8.0368 2.480 3.241 0.003 2.940 13.134x1-1.1184 0.398-2.811 0.009-1.936-0.300x2 0.2648 0.199 1.332 0.195-0.144 0.674x3 0.4927 0.125 3.938 0.001 0.236 0.750x4 1.3832 0.288 4.798 0.000 0.791 1.976====Omnibus: 0.141 Durbin-Watson: 1.762Prob (Omnibus): 0.932 Jarque-Bera (JB): 0.030Skew: 0.052 Prob (JB): 0.985Kurtosis: 2.885 Cond. No. 2.68e+16====OLS model: y = b0 + b1roomX +... + bm*XmParameters: const 8.036813x1-1.118418x2 0.264789x3 0.492728x4 1.383207

At this point, the study on "what is the preparation of Python mathematical modeling StatsModels statistical regression model data" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.