Tencent Technical Engineering | time series prediction based on Prophet 07/15 Update SLTechnology News&Howtos

Tencent Technical Engineering | time series prediction based on Prophet

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Predicting the future is always an exciting and magical thing. For this reason, many time series prediction models have been studied. However, most of the time series models are not effective because the prediction problems are too complex. This is because time series prediction not only needs a lot of statistical knowledge, but more importantly, it needs to integrate the background knowledge of the problem. To this end, Prophet fully combines the two, providing a simpler and flexible prediction method, and has reached a level comparable to that of professional analysts in terms of prediction accuracy. If you are still struggling with time series prediction, let's walk into the exciting and magical world of Prophet.

Preface

Time series prediction has always been a difficult problem in prediction, and it is difficult for people to find a general model suitable for rich scenarios, because in reality, the background knowledge of each prediction problem, such as the process of data generation, is often different. even if it is the same kind of problem, the factors and degrees that affect these predictions are often different. in addition, prediction problems often require a lot of professional statistical knowledge. This brings difficulties to analysts, which makes the problem of time series prediction particularly complex.

Traditional time series prediction methods, such as ARIMA (autoregressive integrated moving average) model, are implemented in both R and Python. Although these traditional methods have been used in many scenarios, they often have the following drawbacks:

a. The applicable time series data is too limited.

For example, the most general ARIMA model requires that the time series data is stable or stable after differential differentiation, and the information is extracted from a fixed period in the differential operation. It is often difficult to match the actual data.

b. The missing value needs to be filled.

In the case of missing values in the data, the traditional methods need to fill the missing values first, which damages the reliability of the data to a great extent.

c. The model lacks flexibility.

The traditional model is only to build temporary dependencies in the data, which is too inflexible to allow users to introduce background knowledge of the problem, or some useful assumptions.

d. The guiding role is weak.

Currently, although R and Python implement these methods and provide visualization, it reduces the threshold for the use of the model. However, due to the reasons of the model itself, these results also make it difficult for users to analyze the potential reasons that affect the prediction accuracy more clearly.

In a word, the traditional time series prediction is difficult to achieve ideal integration in the accuracy of the model and the interaction with users.

Recently, facebook released the prophet (Prophet) project, which has attracted wide attention for its simpler and flexible forecasting methods and results comparable to those of experienced analysts. Let's introduce Prophet.

Prophet introduction

2.1 overall framework

The figure above shows the overall framework of prophet, and the whole process is divided into four parts: Modeling, Forecast Evaluation, Surface Problems, and Visually Inspect Forecasts. On the whole, this is a circular structure, and this structure can be divided into analyst manipulation part and automation part according to the dotted line, so the whole process is the cycle system of the combination of analyst and automation process. it is also a process of combining problem background knowledge with statistical analysis, which greatly increases the scope of application of the model and improves the accuracy of the model. According to the above four parts, the prediction process of prophet is as follows:

A.Modeling: build a time series model. The analyst chooses an appropriate model according to the background of the forecasting problem.

B.Forecast Evaluation: model evaluation. According to the model to simulate the historical data, in the case of uncertain parameters of the model, we can make a variety of attempts, and evaluate which model is more suitable according to the corresponding simulation results.

C.Surface Problems: present the problem. If the overall performance of the model is still not satisfactory after trying a variety of parameters, the potential causes of large errors can be presented to analysts at this time.

D.Visually Inspect Forecasts: feedback the entire prediction in a visual way. When the problem is fed back to the analyst, the analyst considers whether to further adjust and build the model.

2.2 as mentioned earlier in the applicable scenario, there are different solutions to different time series prediction problems. Prophet is suitable for business problems with the following characteristics:

a. Historical data observed hourly, daily or weekly for at least a few months (preferably one year)

b. There are strong seasonal trends at various levels of human scale: some days of the week and some times of the year.

c. There are important holidays (such as National Day) that occur at irregular intervals known in advance.

d. The amount of missing historical data or larger abnormal data is within a reasonable range.

e. Changes with historical trends (for example, because of product releases)

f. There is a natural limit or saturation state for the nonlinear growth trend contained in the data.

2.3 the overall construction of the model principle model is as follows:

The model (1) consists of three parts: growth (growth trend), seasonality (seasonal trend) and holidays (the impact of holidays on the forecast). Where g (t) represents the growth function, which is used to fit the aperiodic changes of the predicted values in the time series; s (t) is used to represent periodic changes, such as weekly, annual seasons, etc.; h (t) represents the impact of the potential non-periodic holidays in the time series on the predicted values. Finally, there is the noise term, which represents the unpredicted fluctuation of the model, which is assumed to be Gaussian distribution.

It can be seen that this is a model similar to generalized additive model (GAM), which is different from previous time series prediction models (such as ARIMA). The above model regards the prediction problem as a curve fitting problem. This has a lot of practical value:

a. With high flexibility, many seasonal trends with different cycles and different assumptions can be easily introduced.

b. There is no need to have a fixed period in the time series, and there is no need to fill the missing values before fitting, which can not be done by traditional models (such as ARIMA).

c. Fitting is very fast, allowing analysts to interactively explore the effect of the model

d. The explanation of the parameters in the model is very strong, which allows analysts to enhance some hypotheses based on heuristics.

The construction of each part of the model is described below.

2.3.1 growth trend growth trend is the core component of the entire model, which indicates how the entire time series is growing and how it is expected to grow in the future. This section provides two models for analysts: Non-linear growth (nonlinear growth) and Linear growth (linear growth).

1.Non-linear growth

The formula of nonlinear growth adopts the model of logical regression:

Here, C is the carrying capacity, which limits the maximum amount of growth that can be achieved, k represents the growth rate and b represents the offset.

Of course, the real growth model is far from that simple, and Prophet mainly considers two practical issues:

(1) the C value is not necessarily constant; (2) the growth rate is not necessarily constant. For (1), C is constructed as a function that varies with time: C (t) = K or C (t) = Mt + K. The following is discussed in detail.

(2) solution: first of all, the model defines the corresponding points when the growth rate k changes, which we call changepoints, which is expressed by the slope adjustment values corresponding to these points, and all the slope adjustment values form a vector. At this point, the growth rate corresponding to each changepoint point becomes. If there are the following definitions:

Then the growth rate at t time can be expressed as:

When the growth rate k is adjusted, the offset b corresponding to each changepoint point should also be adjusted to connect the last point in time of each segment, as follows:

To sum up, combined with (1) and (2), the final piecewise logical regression growth model is:

2.Linear growth

If you think that the overall growth trend of the time series is linear, then you can use a linear model:

The parameter definition here is the same as nonlinear growth, except that each changepoint corresponds to the

Combining the above two growth models, we can see that the most important thing in predicting the growth trend is the designation of these changepoint. When in use, these changepoint can be specified manually or automatically identified according to formulas (3) and (4). At this point, I think

It controls the smoothness of the model as a whole.

2.3.2 seasonal trend

Because time series may contain seasonal trends of multiple periodic types, Fourier series can be used to approximately express this periodic attribute, the formula is as follows:

Where P represents a fixed period (for example, in the data measured in "days", P = 365.25 for annual data and P = 7 for weekly data). 2N indicates the number of cycles we hope to use in the model. A larger N value can fit more complex seasonal functions, but it will also lead to more over-fitting problems. According to the empirical value, the N of the annual cycle is 10 and the N of the weekly cycle is 3.

When all the seasonal time series models in s (t) are combined into a vector X (t), then the final seasonal model is:

Among them, to improve the smoothness of the seasonal model.

2.3.3 Holiday model

A lot of practical experience tells us that holidays or some major events will have a great impact on the time series, and these time points are often not periodic. The analysis of these points is extremely necessary, and sometimes it is far more important than normal.

In view of the difference in the date and influence degree of each holiday (or a known major event), the holiday model regards the influence of different holidays at different time points as an independent model. At the same time, a time window is set for each model, which is mainly due to the fact that there is a window period for the impact of holidays (for example, a few days before and after the Mid-Autumn Festival), and the model sets the influence in the same window period to the same value. For example, if I represents a holiday and represents the time t contained in the window period, the holiday model h (t) can be expressed as:

Among them, it indicates the impact of the holidays in the window period on the predicted value. With the same seasonal trend model, you can define:

that

Among them

The use of Prophet

3.1 Parameter usage

The following is the explanation of the parameters of this module, which users can make full use of to adjust the model:

a. Model parameters of growth trend

Growth: growth trend model. The core components of the whole prediction model are divided into two types: "linear" and "logistic", which represent linear and non-linear growth respectively. The default value is "linear".

Cap: carrying capacity. The maximum value defined in the nonlinear growth trend at which the predicted value will reach saturation. When nonlinear growth is selected, the value of this term must be given.

Changepoints (in the growth model): change the point. The user can fill in the "change point" indicating a change in the growth rate at a known time, and if not, the system will automatically identify it. The default value is "None".

N_changepoints: the user specifies the number of potential "changepoint". The default value is 25.

Changepoint_prior_scale (in the growth model): flexibility of the growth trend model. To adjust the flexibility of "changepoint" selection, the higher the value, the more "changepoint" is selected, which makes the model fit historical data more strongly, but also increases the risk of over-fitting. Default value: 0.05.

b. Model parameters of seasonal trend

Seasonality_prior_scale (in the seasonality model): adjust the strength of seasonal components. The higher the value, the stronger the seasonal fluctuation, and the smaller the value, the more it suppresses the seasonal fluctuation. The default value is 10.0.

c. Model parameters of holidays

Holidays_prior_scale (in the holidays model): adjust the strength of the holiday model components. The higher the value, the greater the impact of the holiday on the model, and the smaller the value, the smaller the impact of the holiday. The default value is 10.0.

Holidays: definition of holidays, configuration files that set the json format of holidays, for example:

Where "holiday" indicates the name of a certain type of holiday, "ds" specifies the specific holiday date, "lower_window" indicates the number of days before the specified date, and "upper_window" indicates the number of days after the specified date. The above four parameters need to be configured.

d. Other parameters needed in the forecast

Freq: the statistical unit (frequency) of time in the data. Default is "D". Statistics are made on a daily basis. For more information, please see here.

Periods: the number of future times that need to be predicted. For example, for daily data, if you want to predict the situation in the coming year, you need to fill in 365.

Mcmc_samples:mcmc sampling is used to obtain the uncertainty of predicting the future. If it is greater than 0, the full Bayesian inference of the mcmc sample will be done. If 0, the maximum a posteriori estimation will be made. The default value is 0.

Interval_width: measure the extent to which trends will change over time. Indicates that the frequency and magnitude of the trend intervals used to predict the future are similar to historical data. The larger the value, the more similar it is. The default value is 0.80. When mcmc_samples = 0, this parameter is only used for the change degree of the growth trend model, and when mcmc_samples > 0, this parameter also includes the degree of seasonal trend change.

Uncertainty_samples: the number of simulation drawings used to estimate the interval of growth trends over time in the future. Default value: 1000.

3.2 result reading and analysis

Once the above configuration is complete, you can then run the model directly and get the results.

3.2.1 Visualization result

The overall prediction situation is the most direct way for us to measure the overall prediction effect of the model, and it is an important source for us to evaluate the prediction level of the current model. At the same time, the visual display can help us to effectively analyze the prediction effect of each time stage in the prediction results.

The above figure is an overall forecast result map, which contains the results from the time starting point of the historical data to the expected future time end point. The ds coordinates in the figure represent the time, and the y coordinates correspond to the predicted values. The black dots in the graph represent the known historical data, by which we can easily find the outliers in the data, and the blue curve represents the predicted value of the model. Taking a closer look at the blue curve, we can find that there is a light blue area on the upper and lower boundary of the curve outline, which represents the upper and lower boundaries of the predicted value of the model. When evaluating the results, we regard the predicted value of the blue curve as the main predicted value, and the predicted value of the upper and lower boundaries as a reference. In addition, the light blue area can be well used for model evaluation, for example, for the following figure:

In the prediction part of the model after 2016, the light blue area is too broad, and the upper and lower boundaries predicted by the model are gradually magnified many times. This shows that the smoothness of the model is too large, and the abnormal points have a great impact on the results. Therefore, the model is not reasonable enough and requires users to reset parameters or preprocess outliers in historical data.

The above figure is the result of growth selecting "linear". If we think that the time series shows a non-linear growth trend, we use the following illustration to illustrate:

The result expression of bulk growth is not much different from that of linear growth, the only thing to note is that the horizontal dotted line in the above chart represents the carrying capacity cap of the nonlinear growth trend, and the predicted results will reach saturation at the dotted line.

In addition to the overall prediction above, Prophet also provides component analysis (component analysis for short). The so-called component analysis refers to the separate analysis of the three major models in formula (1). Component analysis helps us to investigate the impact of each component of the model on the prediction results. Through visual display, we can accurately determine the specific reasons that affect the prediction results, so as to solve the problem. Component analysis is an important source for us to improve the accuracy of the model. For example, the result of the following figure:

The above four charts show the growth trend model (trend), holiday model (holidays) and seasonal model (weekly and yearly) from top to bottom. It should be noted that if the specific holiday information is not specified in the holidays parameter, the module will not automatically analyze this part. If you think there is something unreasonable about the above results, you can change the composition according to the instructions in 2.1. Here you should make use of your professional background knowledge as much as possible to make the influence of each part more realistic. For example, if you think the current effect is overfitted in the annual trend "yearly", you can adjust the seasonality_prior_scale parameter. The smaller the value, the smaller the seasonal fluctuation.

For the above visual analysis, here are some suggestions to facilitate you to locate the problems in the prediction:

a. If the error of the prediction result is large, consider whether the selected model is accurate or not, try to adjust the parameters of the growth rate model (growth) and, if necessary, adjust the seasonal (seasonality) parameters.

b. If there is still a big error in the prediction of some dates in most of the methods tried, it means that there are outliers in the historical data. The best way is to find these outliers and eliminate them. Users do not need to interpolate the eliminated data like other methods, but can only retain the corresponding time of the outliers and modify the outliers to null values (NA). The model can still give the corresponding prediction results at this time point.

c. If it is found that the error increases sharply from one cut-off point to the next cut-off point during the simulation prediction of historical data, it shows that the data generation process has changed greatly during the two cut-off points, and a "changepoint" should be added between the two cut-off points to model the different stages during this period.

reference

Sean J. Taylor and Benjamin Letham.Forecasting at Scale. https://research.fb.com/prophet-forecasting-at-scale

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.