Introduction and selection skills of time Series Database TSDB 07/06 Update SLTechnology News&Howtos

Introduction and selection skills of time Series Database TSDB

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction and selection skills of time series database TSDB in order to let you know more about time series database, the editor summarizes the following contents, let's look down together.

Background

There has been a new trend in the Internet industry in the past two years, always listening to all kinds of high-end new terms. Big data, artificial intelligence, Internet of things, machine learning, business intelligence, intelligent early warning and so on.

The previous system, do data visualization, information management, process control. Now the business is no longer satisfied with this simple management and control. Data visualization analysis, big data information mining, statistical prediction, modeling and simulation, intelligent control have become the pursuit of all kinds of business.

"everything disappears in time like tears, time is dying." in the past, we used the Internet to solve real problems. Now we are no longer satisfied with the reality, the data will be connected into time series, we can look forward to its history, reveal its regularity, and then grasp its trend and predict its trend.

Therefore, we begin to store a large number of time-related data (such as logs, user behavior, etc.), summarize the structural characteristics and common usage scenarios of these data, constantly improve and optimize them, and create a new type of database classification-time series database (Time Series Database).

Time series model

Time series database is mainly used to deal with data with time tags (changing according to the order of time, that is, time serialization). Data with time tags is also called time series data.

The structure of each timing point is as follows:

Timestamp: the time of the data point, indicating the time when the data occurred. Metric: the name of the metric, the identity of the current data, also known as name in some systems. Value: value, the value of data, generally of double type, such as cpu usage, traffic, etc. Some systems can only have one value at a data point, and multiple value is multiple time series. Some systems can have multiple values, using different key to represent tag: subordinate attributes.

Realize

For example, I want to record a series of sensor time series data. The data structure is as follows:

* identifiers: device_id, timestamp * metadata: location_id,dev_type,firmware_version,customer_id* device indicators: cpu_1m_avg,free_mem,used_mem,net_rssi,net_loss, battery * sensor indicators: temperature, humidity, pressure, CO,NO2,PM10

If you use traditional RDBMS storage, you can create a table with the following structure:

This is the simplest time series library. But this only meets the needs of the data model. We also need to do more in terms of performance, efficient storage, high availability, distribution, and ease of use.

You can think about how you would design a time series database, what performance optimizations you would consider, how to achieve high availability, and how to make it easy to use.

Timescale

This database is actually a time series database based on the transformation of the traditional relational database postgresql. Students who know postgresql know that postgresql is a powerful, open source, scalable database system.

So timescale.inc developed Timescale, a time series database compatible with sql, and the underlying storage architecture is on postgresql. Provide services as an extension of postgresql. Its characteristics are as follows:

Basics:

All SQL natively supported by PostgreSQL, including complete SQL interfaces (including secondary indexes, non-time aggregations, subqueries, JOIN, window functions) with PostgreSQL clients or tools, can be directly applied to the database without change. Time-oriented features, API functions and corresponding optimizations. Reliable data storage.

Extend:

Transparent time / space partitioning for magnifying (single node) and expanding high data write rates (including batch commit, in-memory index, transaction support, data backup support) appropriate size blocks (2D data partitions) on a single node, to ensure that it can be read quickly even when there is a large amount of data. Parallel operations between blocks and servers

Disadvantages:

Because TimescaleDB does not use column memory technology, its compression effect on time series data is not very good, and the compression ratio is about 4X. At present, it does not fully support distributed expansion (related functions are being developed), so it will require high performance of the server.

In fact, everyone can take a closer look at this database. We are all familiar with RDBMS, and knowing this will give us a deeper understanding of RDBMS, its implementation mechanism and storage mechanism. In the special processing of time series, we can learn the characteristics of time series data and how to optimize RDBMS according to the time series model.

After that, we can also write an article to gain an in-depth understanding of the characteristics and implementation of this database.

Influxdb

Influxdb is a popular time series database in the industry, especially in the field of IOT and monitoring. It is developed in go language and is characterized by performance.

Properties:

Efficient time series data writing performance. Custom TSM engine for fast data writing and efficient data compression. No additional storage dependencies. Simple, high-performance HTTP query and write API. Support data intake for many different protocols as plug-ins, such as graphite,collectd, and openTSDBSQL-like query languages, to simplify query and aggregation operations. Index Tags, support fast and efficient query time series. The retention policy effectively removes expired data. Continuous query automatically calculates aggregate data to make frequent query more efficient.

Influxdb has turned the distributed version into a closed source. So it is a weakness in distributed clustering, which needs to be implemented on its own.

OpenTSDB

The Scalable Time Series Database. This is what I saw at first glance when I opened the OpenTSDB official website. It takes Scalable as its important feature. OpenTSDB runs on Hadoop and HBase, which takes full advantage of the features of HBase. Services are provided through a separate Time Series Demon (TSD), so it can be easily scaled up or down by adding or decreasing service nodes.

Opentsdb is a time series database based on Hbase (the new version also supports Cassandra).

Based on the distributed column storage feature of Hbase, it realizes the characteristics of high data availability and high performance. Limited to Hbase, the storage space is large and the compression is insufficient. Rely on a complete set of HBase, ZooKeeper

Adopt schemaless tagset data structure (sys.cpu.user 1436333416 23 host=web01 user=10001)

The structure is simple and multi-value queries are not friendly.

HTTP-DSL query

OpenTSDB's table design and RowKey design for TSDB on HBase is a feature worthy of our in-depth study. Interested students can find some detailed materials to study.

Druid

Druid is a real-time online analysis system (LOAP). Its architecture combines the characteristics of real-time online data analysis, full-text retrieval system and time series system, so that it can meet the data storage needs of different scenarios.

Use column storage: support efficient scanning and aggregation, easy to compress data. Scalable distributed system: Druid itself implements a scalable and fault-tolerant distributed cluster architecture. Deployment is simple. Powerful parallelism: Druid cluster nodes can provide query services in parallel. Real-time and bulk data intake: Druid can consume data in real time, such as through Kafka. You can also ingest data in bulk, such as importing data through Hadoop. Self-recovery, self-balancing, easy to operate and maintain: Druid's own architecture achieves fault tolerance and high availability. Different service nodes can add or decrease nodes according to the response demand. Fault-tolerant architecture to ensure that data is not lost: Druid data can retain multiple copies. In addition, HDFS can be used as a deep storage to ensure that the data is not lost. Index: Druid implements reverse encoding and Bitmap indexing of String columns, so it supports efficient filter and groupby. Time-based partitioning: Druid partitions the raw data based on time, so Druid will be more efficient in querying time-based ranges. Automatic preaggregation: Druid supports preaggregation of data during the data intake period.

The Druid architecture is quite complex. According to the function, the whole system is subdivided into a variety of services, and the systems with different responsibilities of query, data and master are deployed independently to provide unified storage and query services. It provides an underlying data storage service in the form of distributed cluster services.

The architectural design of Druid is worth learning. If you are interested not only in time series storage, but also in distributed cluster architecture, take a look at the architecture of Druid. In addition, the design of Druid in segment (data storage structure of Druid) is also a bright spot, which realizes not only column storage but also reverse index.

Elasticsearch

Elasticsearch is a distributed open source search and analysis engine for all types of data, including text, numbers, geospatial, structured and unstructured data. Elasticsearch was developed on the basis of Apache Lucene and was first released by Elasticsearch N.V. (now Elastic) in 2010. Elasticsearch is known for its simple REST style API, distributed features, speed, and extensibility.

Elasticsearch is known as ELK stack. Many companies build log analysis systems and real-time search systems based on ELK. We started to develop the metric monitoring system on the basis of ELK. That is, I think of using Elasticsearch to store time series database. The mapping of Elasticserach is optimized to make it more suitable for storing time series data model, which achieves good results and fully meets the needs of the business. Later found that the new version of Elasticsearch also began to release Metrics components and APM components, and a large number of promotion of its full-text retrieval, the storage capacity of time series. It really coincides with what we thought at that time.

For timing optimization of Elasticsearch, please refer to this article: "elasticsearch-as-a-time-series-data-store"

You can also take a look at Elasticsearch's Metric component: Elastic Metrics

Beringei

Beringei is the latest open source high-performance in-memory sequential data storage engine from Facebook in 2017. It has the characteristics of fast reading and writing and high compression ratio.

In 2015, Facebook published a paper "Gorilla: a Fast, Scalable, In-Memory Time Series Database". Beringei is a time series database based on this idea.

Beringei uses the Delta-of-Delta algorithm to store data and uses XOR encoding to compress values. So that it can store a large amount of data with very little memory.

How to choose a suitable time series database

Data model

There are generally two kinds of time series data models, one without schema and with multi-tag, and the other with name, timestamp and value. The former is suitable for multi-value model and more suitable for complex business model. The latter is more suitable for one-dimensional data model.

Query language

Most TSDB currently support HTTP-based SQL-like queries.

Reliability

Availability is mainly reflected in the stability and high availability of the system, as well as the high availability of data storage. An excellent system should have an elegant and highly available architectural design. Simple and stable.

Performance

Performance is a factor that we must consider. When we start to think about data storage in more subdivided areas, in addition to the requirements of the data model, the big reason is that the performance of the general database system can not meet our needs. Most time series libraries tend to write more and read fewer scenarios, and users need to balance their own needs. The following will be a comparison of the performance of each library, you can do a reference.

Ecosystem

I have always thought that ecology is an issue that we must seriously consider when choosing an open source component. An ecologically excellent system will use more people and will have fewer undiscovered pits. In addition, if you encounter problems in use and turn to the community for help, you can often get some better solutions. In addition, with a good ecology, the surrounding boundary system will be very mature, which allows us to have more mature solutions when docking with other systems.

Operational management

Easy operation and maintenance, easy to operate.

Company and support

The supporting company behind a system is also more important. There is a strong company or organization behind it, which will have a greater experience in project availability assurance and later maintenance updates.

Performance comparison TimescaleInfluxDBOpenTSDBDruidElasticsearchBeringeiwrite (single node) 15K/sec470k/sec32k/sec25k/sec30k/sec10m/secwrite (5 node) 128k/sec100k/sec120k/sec Summary

You can choose the appropriate storage according to the following requirements:

Small and fine, high performance, small amount of data (100 million): InfluxDB is simple, the amount of data is not large (tens of millions), there are joint query, relational database foundation: timescales large amount of data, big data service foundation, distributed cluster requirements: opentsdb, KairosDB distributed cluster requirements, olap real-time online analysis, sufficient resources: druid performance is the ultimate pursuit, data cold and hot differences: Beringei both retrieval and loading Distributed aggregate computing: elsaticsearch if you have both indexing and time series requirements. Then Druid and Elasticsearch are the best choices. Its performance is not poor, both meet the characteristics of retrieval and time series, and are highly available fault-tolerant architecture. Last

Then we can take a closer look at one or two TSDB, such as Influxdb,OpenTSDB,Druid,Elasticsearch and so on. Based on this, you can learn the difference between row storage and column storage, the implementation principle of LSM, the compression of numerical data, MMap to improve read and write performance and so on.

On the time series database TSDB introduction and selection skills to share here, of course, not only the above and everyone's analysis methods, but the editor can ensure its accuracy is absolutely no problem. I hope that the above content can have a certain reference value for everyone, and can be put into practice. If you like this article, you might as well share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.