How to use Python crawler to predict the sales of Singles Day this year 07/09 Update SLTechnology News&Howtos

How to use Python crawler to predict the sales of Singles Day this year

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use Python crawler to predict this year's Singles Day sales". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use Python crawler to predict this year's Singles Day sales.

NO.1 statistics of Singles' Day sales over the years

Collected from the Internet Taobao Tmall double 11 sales data, units for 100 million yuan, using Pandas collated into Dataframe, and added a column of 'year int', to be used for follow-up calculation.

Import pandas as pd# data is collected on the Internet, Taobao Tmall Singles Day sales data over the years, in units of 100 million yuan Just demonstrate double11_sales = {'2009': [0.50], '2010': [9.36], '2011': [34], '2012': [191], '2013': [350], '2014': [571], '2015': [912] '2016': [1207], '2017': [1682], '2018': [2135], '2019': [2684], '20120': [4982] } df = pd.DataFrame (double11_sales). T.reset_index () df.rename (columns= {'index':' year', 0T.reset_index 'sales volume'}, inplace=True) df ['year int'] = [[I] for i in list (range (df [' year']) + 1)] df.dataframe tbody tr th {vertical-align: top }. Dataframe thead th {text-align: right;}

Drawing scatter plot with NO.2

Using the plotly toolkit, you can draw a scatter chart of the year's corresponding sales, and you can clearly see the 2020 data soar immediately.

# scatter plot import plotly as pyimport plotly.graph_objs as goimport numpy as npyear = df [:] ['year'] sales = df ['sales'] trace = go.Scatter (x=year, y=sales, mode='markers') data= [trace] layout= go.Layout (title=' 2009-2020 Tmall Taobao Singles' Calendar') fig = go.Figure (data=data, layout=layout) fig.show ()

NO.3 introduces Scikit-Learn library to build model

Univariate multiple linear regression

Let's first review how wonderful the data for 2009-2019 are. First, only the data for 2009-2019 are selected:

Df_2009_2019 = df [:-1] df_2009_2019.dataframe tbody tr th {vertical-align: top;} .dataframe thead th {text-align: right;}

Generate secondary data from the following code:

From sklearn.preprocessing import PolynomialFeaturespoly_reg = PolynomialFeatures (degree=2) X_ = poly_reg.fit_transform (list (df_2009_2019 ['year int']))

1. The first line of code introduces the module PolynomialFeatures, which is used to add multiple items

two。 The second line of code sets the highest secondary term as quadratic term in preparation for generating quadratic term data (x square).

3. The third line of code converts the original X into a new two-dimensional array Xtrees, which contains the newly generated quadratic data (x square) and the original quadratic data (x).

The content of X_ is a two-dimensional array shown in the following code, in which the first column of data is a constant term (actually the zero power of X), which has no special meaning and will not affect the analysis result; the second column of data is the original primary term data (x); and the third column data is the newly generated quadratic data (x squared).

X_array ([1, 1, 1.], [1, 2, 4.], [1, 3, 9.], [1, 4, 16.], [1, 5, 25.], [1, 6, 36.], [1, 7, 49.] [1,8.64.], [1.9.81.], [1.10.100.], [1.11.121.]) from sklearn.linear_model import LinearRegressionregr = LinearRegression () regr.fit (df_2009_2019 ['sales']) LinearRegression ()

1. The first line of code introduces the related module LinearRegression of linear regression from the Scikit-Learn library.

two。 The second line of code constructs an initial linear regression model and names it regr

3. The third line of code uses the fit () function to build the model, and the regr is a built linear regression model.

NO.4 model prediction

Then we can use the built model regr to predict the data. Plus the independent variable is 12, then the predict () function can be used to predict the corresponding dependent variables, as follows:

XX_ = poly_reg.fit_transform ([[12]]) XX_array ([[1.12.144.]]) y = regr.predict (XX_) yarray ([3282.23478788])

Here we get the result that if we forecast 2020 according to the trend of 2009-2019, it is 3282, but it is actually 498.2 billion. The reason is that the amount of money has suddenly become larger and is plotted according to the combined calculation mentioned above, which is as follows:

# scatter plot import plotly as pyimport plotly.graph_objs as goimport numpy as npyear = list (df ['year']) sales = df ['sales'] trace1 = go.Scatter (x=year, y=sales, mode='markers', name= "actual sales" # first legend name) XX_ = poly_reg.fit_transform (list (df ['year int']) + [[13]]) regr = LinearRegression () regr.fit (X_ List (df_2009_2019 ['sales']) trace2 = go.Scatter (x=list (df ['year']), y=regr.predict (XX_), mode='lines', name= "fitting data", # second legend name) data = [trace1,trace2] layout = go.Layout (title=' Tmall Taobao double Eleven sales', xaxis_title=' year' Yaxis_title=' sales') fig = go.Figure (data=data, layout=layout) fig.show ()

NO.5 forecasts sales in 2021

Now that the data have deviated greatly, let's not delve into it and work miracles. In the same way, take the real data for 2020 into account, don't say a word, and see what happens:

From sklearn.preprocessing import PolynomialFeaturespoly_reg = PolynomialFeatures (degree=5) X _ = poly_reg.fit_transform (list (df ['year int'])) # # Forecast of 2020 regr = LinearRegression () regr.fit (X_ List (df ['sales']) LinearRegression () XXX_ = poly_reg.fit_transform (list (df ['year int']) + [[13]]) # scatter chart import plotly as pyimport plotly.graph_objs as goimport numpy as npyear = list (df [' year']) sales = df ['sales'] trace1 = go.Scatter (2021, 2022, 2023], y=sales, mode='markers' Name= "actual sales" # first legend name) trace2 = go.Scatter (x years + ['2021', '2022', '2023'], y=regr.predict (XXX_), mode='lines', name= "forecast sales" # first legend name) trace3 = go.Scatter (x = ['2021'], y = [regr.predict (XXX_) [- 1]], mode='markers' Name= "2021 Forecast sales" # first legend name) data= [trace1,trace2,trace3] layout= go.Layout (title=' Tmall Taobao double Eleven sales', xaxis_title=' year', yaxis_title=' sales') fig = go.Figure (data=data, layout=layout) fig.show ()

How to choose the degree of prediction by NO.6 polynomials

In the aspect of selecting the number of times in the model, we can set the program to calculate the prediction error under each number of times, and then reverse select the parameters according to the results.

Df_new = df.copy () df_new ['year int'] = df [' year int'] .apply (lambda x: X [0]) df_new.dataframe tbody tr th {vertical-align: top;}. Dataframe thead th {text-align: right;}

# selection of polynomial regression prediction times # calculate the MSE evaluation index of m degree polynomial regression prediction results and draw from sklearn.pipeline import make_pipelinefrom sklearn.metrics import mean_squared_errortrain_df = df_new [: int (len (df) * 0.95)] test_df = df_ new [len (df) * 0.5):] # define the independent and dependent variables used in training and testing train_x = train_ Df ['year int']. Valuestrain_y = train_df [' sales volume'] .values # print (train_x) test_x = test_df ['year int']. Valuestest_y = test_df [' sales volume']. Valuestrain_x = train_x.reshape (len (train_x)) 1) test_x = test_x.reshape (len (test_x), 1) train_y = train_y.reshape (len (train_y), 1) mse = [] # used to store the highest degree polynomial MSE value m = 1 # initial m value m_max = 10 # set the maximum number of times while m

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.