Case Analysis of time Series Database ModelarDB 07/19 Update SLTechnology News&Howtos

Case Analysis of time Series Database ModelarDB

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "ModelarDB instance Analysis of time Series Database". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn the "time series database ModelarDB case analysis" bar!

Problem background

Industrial systems (such as fans) generate too much data to store all the original data, and now it is common to store only aggregate information. But this will lose fluctuations and outliers in the original data, but usually this information is valuable and can be used for fault diagnosis.

Temporal database needs to have important properties: distributed, stream processing (visible on writing), high compression, efficient retrieval, fuzzy query processing AQP (Approximate Query Processing), scalability (no need to modify the code to increase domain knowledge).

Time series.

Time series (Time Series): a series of tuples with time and value, and the time dimension is incremented.

For example: (100pr 28.3) (200jue 30.7) (300pr 28.3) (400pr 28.3) (500je 15.2).

A time series with finite data points is called bounded time series.

Fixed frequency time series (Regular Time Series): the time intervals of two adjacent time points are equal.

The one up there is fixed frequency.

Sampling interval (Sampling Interval): the time interval between two adjacent time points in a fixed frequency time series.

The interval above is 100.

Model

The above concept is nothing new, the focus is on the model, this article is to understand what the model is:

Model: a representation of a time series, including two functions (Mest,Merr). The first function inputs a point in time and gives an estimated value. The second function inputs a time series and the first function and gives a positive real number as an error estimate.

Taking the above time series containing five points as an example, you can give a model:

Mest =-0.0024 * ti + 29.5,1 ≤ I ≤ 5

Merr = max (| vi-Mest (ti) |), 1 ≤ I ≤ 5

Here vi and ti are obtained from the original time series. In fact, a first-order function is used to estimate the value, calculate the absolute error of each point, and retain the largest one.

This model is fine, but at least the original time series is needed when calculating Merr.

Discontinuity (GAP): a ts,te is used to indicate the size of the discontinuity between two fixed-frequency time series with the same sampling interval generated by a data source, where te = ts + m * sampling interval, m is greater than or equal to 2, that is, at least one point needs to be missing, because m is 1 when none is missing.

For example, there are discontinuities in the middle of the time series such as (100) (200) (200) (400), that is, time series with indefinite frequency.

The GAP of the time series with variable frequency is filled with null value, and it becomes a fixed frequency time series with discontinuity.

Segment: a segment is a bounded fixed-frequency time series with discontinuity, including several elements: start time, end time, sampling interval, set of null time points, model, error.

This segment is the final boss, so much is pushed in front of it to lead to segment, and then the system stores segment. ModelarDB is only suitable for fixed frequency time series, which is a hard wound.

For a time series with 5 points, if the fifth point does not meet the user-defined error rate, the first four are represented by segment, and the fifth point is created after the next data comes, as shown in the following illustration:

System architecture

It's a system, but it's actually a jar package. This jar package relies on Spark, Spark-Cassandra-Connector and Cassandra to implement their interfaces.

The architecture diagram of ModelarDB is shown below, which basically includes a data import module (generating segment), a query interface, a storage interface, and a metadata cache module.

This figure shows that there is a Spark node and Cassandra on each ModelarDB node to ensure the locality of the data, which any client using Spark-Cassandra-Connector can do.

Data flow: transform the time series data through the segment generator, select the appropriate model, generate a pile of segment, then cache in memory, and persist the old segment to Cassandra. Both in memory and in Cassandra can be queried.

Why Spark and Cassandra? Because they are mature distributed systems, they are born with highly available features, and they are easy to integrate and have ready-made extended interfaces. A Simba system is also mentioned here, which is also based on Spark to manage spatio-temporal data, which is similar to the principle of ModelarDB.

Mode of use

Query: just submit the jar package of ModelarDB into a Spark job, and Spark will automatically distribute the jar package for parallel execution, which looks like a distributed time series data query.

Import: you can directly java-jar start the main function, which will automatically start SparkSession, using spark local mode to write data to Cassandra.

Fault tolerance

The author discusses the fault-tolerant mechanism, because of the integrated existing distributed system, it only considers things at the system architecture level and does not consider the details, such as what happens when a node dies in Cassandra.

There are only three kinds of errors: (1) data import, (2) data in memory, and (3) data on disk. These three situations have different solutions.

(1) the first method is to cache the data in kafka, so that the ModelarDB fails when it is imported, and the data is still in kafka. Although the solution is very cheating, it has nothing to do with ModelarDB, but it is very practical, and I would choose the same in the actual scene.

The other is parallel import in multiple nodes (the author did not elaborate, I think it is to give a piece of data to multiple nodes to parse at the same time, because the key is the same, only one copy will be left), but this will be very resource-consuming and unnecessary.

(2) (3) use the copies that come with Spark and Cassandra to ensure security. The copy of Cassandra is understandable. After all, it is a database. What is the copy of Spark? Personally, I think it is the fault-tolerant mechanism of Spark's RDD, and a RDD is broken and re-calculated from the source.

And in order to ensure the import speed, the author uses a single node to import data, which allows part of the data to be lost. It doesn't work, kafka. The fault tolerance mechanism is directly used in Spark and Cassandra, and it has not been modified.

In fact, I only discussed fault tolerance at the architectural level, but no extra work was actually done. This is also the advantage of taking advantage of the existing system, although I did not do it, but it is also part of the feature of the system.

Model compression example

When the data is imported, it will be segmented automatically according to the characteristics of the time series to generate multiple segment. The focus of the paper is this part, the rest are more engineering things.

The compression method proposed by ModelarDB makes a balance between high compression ratio and low latency. The delay here is the time window in stream processing, which refers to the maximum number of unsearchable points in this article.

For example:

The system is divided into three layers, the top layer is the segment generator, in which there is a buffer of data points, which is used to receive data points, the solid line is cached and the dotted line is deleted. Here, the maximum delay is set to 3 points, that is, only the nearest 2 points are not visible at most. When the third point arrives, you need to create a temporary period (ST) in memory to support query.

T represents the last point of segment in memory. In figure T3 above, a segment is generated and copied to cache. At this time, all the points before T3 are visible. When T4 is received, it can be added to the previous segment, but it is not in a hurry to be visible to users, so leave it first. If the current segment has saved enough 3 points, update it to cache again.

If you encounter an outlier outside the threshold set by the user, close the current segment, update it to the cache, and delete the buffer. The last point of segment is F.

Ye is the number of points in buffer that have not been included by segment. When the segment in the cache reaches a certain size, it will be brushed to the equal storage storage.

Model-agnostic copression: a model-independent compression algorithm

The above example has only one model. In the actual algorithm, multiple models are supported to compress a time series. When there are multiple models, the process of generating a final segment at a time is as follows:

Take a dot from TS and put it in buffer first. Try to add to the first model, and when the new points cannot be represented by the current model, try to represent all the points in the buffer with the next model. If all the models have been tried, select the model with the highest compression ratio as the final segment (SF) into the cache.

Let's look at an example. Suppose there are a few points in buffer, and all three models are tried. The one with the highest compression ratio is not necessarily the one with the most points, but the model2 has the highest compression ratio, so it is brushed out. It's mainly about who eats well, not who eats more.

For example, the first time model2 wins, segment1 is brushed into cache, and then the three models continue to eat from the fourth point, this time model3 compression is the best, so segment2 is brushed out again. The number of segment here only starts with 1, and it has nothing to do with model id.

This compression algorithm is the model agnostic, in fact, it is to select the best model dynamically.

The model is also extensible, anyone can implement the interface of the model in ModelarDB to extend the model, which is more flexible.

Query mode

ModelarDB provides two views to support query, the first is segment view (segment ID, start time, end time, sampling interval, model ID, model parameters), and the second is point view (segment ID, timestamp, value). These two views are two table structures. Sql also has to write for these two table structures.

The single point interface is also implemented on segment in the end. So you can only consider segment queries.

Optimize line reorganization

This is a very engineering thing, used to speed up the reorganization of the line.

The query in SparkSQL selects some columns in the view and gives them to ModelarDB to execute. After executing the result, it needs to be returned to SparkSQL in a line-by-line format, which is basically the interface of SparkSQL.

Every time I spell a row of data, I need to find the corresponding value one by one according to the column name given to me by SparkSQL, which is more laborious. The author provides a function here that receives a data point and returns a row directly.

How do you generate this function? Take the point view as an example: (segment ID, timestamp, value), and the subscript of each column is 1pm 2pm 3 respectively.

First of all, get the stitching of the index of each column according to the point view and the column name of the query, for example, what I query is (timestamp, value), the spliced out is 23, (value, segment ID) = 31.

Write this function manually for each combination. Because each view has no more than 10 columns and the table structure is fixed, this optimization is feasible and the workload is acceptable. This method is not applicable if the table structure is not fixed or if there are too many rows.

Underlying storage

The table structure in Cassandra is as follows: there are three tables, Time Series stores segment id and sampling interval, Segment table stores segment information, and model table stores model information.

A Time Series can correspond to many segment, and a model can correspond to many segment. Can do predicate push-down, but also take advantage of the function of Spark-Cassandra-Connector.

Contrast

Compression ratio: using the model instead of the original data can certainly be pressed very well, which is compared with other popular time series databases and big data file formats.

Write speed: hang other systems and file formats, this is needless to say, after all, ModelarDB does not save the original point, the advantage of IWeiO is relatively large.

Limitation

Only support fixed frequency data, it feels that the death penalty can be declared.

At the beginning of the article, it is said that the industrial scene is complex, and the data may be missing and out of order, but later there is no solution for disorder.

For a time series, all models are tried in each segment. That is, the writing speed is proportional to the number of models, and more candidate models will slow down the writing speed, but the author did not mention it.

Individuals feel that lossy compression is unacceptable, and I have never seen that a practical database is lossy.

At this point, I believe you have a deeper understanding of the "time series database ModelarDB instance analysis", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.