What is the principle of influxdb? 07/15 Update SLTechnology News&Howtos

What is the principle of influxdb?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the principle of influxdb". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the principle of influxdb"?

Referenc

LSM Tree explains https://zhuanlan.zhihu.com/p/181498475 in detail

Https://www.jianshu.com/p/a3a2f8f5dd65

Http://hbasefly.com/2017/12/08/influxdb-1/?qytefg=c4ft23

Http://hbasefly.com/2017/11/19/timeseries-database-2/?fypwxu=zot6w

Https://blog.fatedier.com/2016/08/05/detailed-in-influxdb-tsm-storage-engine-one/

What is a time series database

Data pattern: the time series data increases with time, the values of the same dimensions are repeated, and the indicators change smoothly.

Write: continuous high concurrent write, no update operation: time series database is often faced with real-time data writing of millions or even tens of millions of terminal devices (for example, mobike has 10 million vehicles in 2017), but most of the data represent the status of the equipment and will not be updated after writing.

Query: statistical analysis of indicators according to different dimensions, and there are obvious hot and cold data, generally only frequently query recent data

You can see that the timing database needs to solve the following problems:

Time series data writing: how to support tens of millions of data points per second.

Reading of time series data: how to support the grouping aggregation operation of hundreds of millions of data in seconds.

Cost sensitivity: it is a cost problem caused by massive data storage. How to store these data at a lower cost will become the top priority of the time series database.

LSM tree

More popularly speaking, the principle of LSM tree is to split a big tree into N small trees, it is first written into memory, as the small trees get bigger and bigger, the small trees in memory will batch flush to independent files in the disk to improve IO performance, and in order to improve read performance the trees in the disk can regularly do merge operations and merge into a big tree.

SSTable is the orderly storage of the data in MemTable on disk, and its internal data is arranged from small to large according to key. Usually in order to speed up the search, it is necessary to add a data index to the SSTable, so that you can quickly read and locate the specified KMTV data.

As the name implies, Immutable Memtable is a read-only MemTable in memory. Because the memory is limited, we usually set a threshold. When the memory occupied by MemTable reaches the threshold, it is automatically converted to Immutable Memtable,Immutable Memtable and MemTable. The difference between Immutable Memtable,Immutable Memtable and MemTable is that it is read-only, and the system will generate a new MemTable for write operations to continue to write. The reason for using Immutable Memtable is to avoid blocking write operations when serializing the contents of MemTable to disk.

MemTable corresponds to the WAL file, which is the storage structure of the contents of the file in memory, which is usually implemented in SkipList. MemTable provides an operation interface for writing, deleting and reading kmurv data. Internally, the KMurv pairs are stored sequentially according to the key value, so that it is convenient to quickly serialize to the SSTable file, still keeping the order of the data.

MemTable

Immutable Memtable

SSTable

TSM (Time-Structured Merge Tree)

The underlying storage engine of InfluxDB has gone through the process from LevelDB to BlotDB, and then to the selection of self-developed TSM, and the whole thinking of choice transformation can be seen in its official website documents. The whole thinking process is worth learning, and the thinking about technology selection and transformation is always much more impressive than the plain description of the characteristics of a product.

Its whole storage engine selection transformation process, the first stage is LevelDB, the main reason for the selection of LevelDB is that its underlying data structure uses LSM, is very write-friendly, can provide high write throughput, more in line with the characteristics of time series data. In LevelDB, data is stored in KeyValue and sorted by Key. The Key design used by InfluxDB is a combination of SeriesKey+Timestamp, so the data of the same SeriesKey is sorted and stored by timestamp, which can provide efficient scanning by time range.

However, one of the biggest problems with using LevelDB is that InfluxDB supports automatic historical data deletion (Retention Policy). In time series data scenarios, automatic data deletion is usually large chunks of continuous historical data deletion. LevelDB does not support Range delete or TTL (time to live), so deletion can only be a key deletion, which will cause a lot of deletion traffic pressure. Under the data structure of LSM, the real physical deletion is not immediate and will not take effect until compaction.

Time series database structure

OpenTSDB/HBase

OpenTSDB stores time series data based on HBase. At the HBase level, RowKey rules are designed as follows: metric+timestamp+datasource (tags)

Problem 1: there are many useless fields. Only rowkey is useful in a KeyValue, and other fields such as columnfamily, column, timestamp and keytype have no practical significance in theory, but they must exist in HBase's storage system, so it costs a lot of storage cost.

Problem 2: redundancy of data sources and collection indicators. Rowkey in KeyValue is equal to metric+timestamp+datasource. Imagine the same collection index of the same data source, which continues to spit out the collection data with the passage of time. In theory, these data share the same data source (datasource) and collection index (metric), but in this storage system of HBase, sharing can not be reflected, so there are a lot of data redundancy, mainly data source redundancy and collection index redundancy.

Problem 3: it can not be compressed effectively. HBase provides block-level compression algorithms-snappy, gzip, etc., these general compression algorithms are not set for time series data, and the compression efficiency is relatively low. HBase also provides some coding algorithms, such as FastDiff, etc., which can play a certain compression effect, but the effect is not good. The main reason for the poor effect is that HBase does not have the concept of data type, there is no concept of schema, can not be specific coding for specific data types, can only choose a general coding, the effect can be imagined.

Problem 4: the ability of multi-dimensional query can not be fully guaranteed. HBase itself does not have schema, and currently does not implement inverted indexing mechanism. All queries must specify metric, timestamp and complete tags or prefix tags to query, and it is difficult to query suffix dimensions.

Influxdb

Compared with OpenTSDB and Druid, many children's shoes may not be particularly familiar with InfluxDB, but InfluxDB is far ahead in the ranking of time series database. InfluxDB is a professional time series database, which only stores time series data, so a lot of optimization work can be done for time series data in the storage of data model.

In order to ensure the high efficiency of writing, InfluxDB also adopts LSM structure, and the data is written to memory first. When the memory capacity reaches a certain threshold, flush to file InfluxDB puts forward a very important concept in the design of temporal data model: seriesKey,seriesKey is actually measurement+datasource (tags). After the time series data is written to memory, it is organized according to seriesKey:

In memory is actually a Map:,Map in which a SeriesKey+fieldKey corresponds to a List,List in which timeline data is stored. After the data comes in, it is assembled into SeriesKey according to measurement+datasource (tags), plus fieldKey, and then the combined values of Timestamp | Value are written into the timeline data List. After the in-memory data flush file, the timeline data in the same SeriesKey will also be written to the same Block block, that is, the data in a Block block all belong to the same field under the same data source.

Multidimensional search is not and cannot be realized by combining datasource (tags) and metric into SeriesKey. This is true, but InfluxDB internally implements the inverted index mechanism, that is, it implements the mapping from tag to SeriesKey. If users want to search according to a certain tag, first find the corresponding SeriesKey in the inverted index according to tag, and then locate the specific timeline data according to SeriesKey. This storage engine of InfluxDB is called TSM, and its full name is Timestamp-Structure Merge Tree, and its basic principle is similar to LSM. Later, the author will introduce the data writing, file format, inverted index and data reading of InfluxDB.

InfluxDB data model

Measurement: in principle, it is more like the concept of table in SQL.

Tags: dimension column

In InfluxDB, the Tags combination in the table is used as the primary key of the record, so the primary key is not unique. For example, the primary key of the first and third rows of records in the table above is' location=1,scientist=langstroth'. All time series queries are eventually based on the primary key query and then filtered by a timestamp.

Fields: numeric column. The numeric column stores the user's time series data.

Point: similar to a row of records in SQL, but not a dot.

InfluxDB system architecture

At this point, I believe you have a deeper understanding of "what is the principle of influxdb". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.