Chinese version of OpenTsdb official documentation-aggregator 07/03 Update SLTechnology News&Howtos

Chinese version of OpenTsdb official documentation-aggregator

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

OpenTSDB is designed to effectively combine multiple different time series during query execution. The reason is that when users look at their data, they usually start asking questions from a high-level perspective, such as "what is the total throughput of the data center" or "how much electricity is used in the current region". After looking at these high-level values, one or more values may appear so that users can drill down into more refined data sets, such as "what is the throughput per host in my LAX data center?" We hope to be able to answer these high-level questions easily and still allow us to learn more details.

but how do you merge multiple separate time series into a series of data? Aggregate functions provide a mathematical method for merging different time series into one. The filter is used to group the results by label and then apply the aggregation to each group. Aggregation is similar to the Group By clause of SQL, where the user selects a predefined aggregate function to merge multiple records into a single result. However, in TSD, a series of records are grouped and aggregated into a group according to each timestamp.

each aggregator has two components:

Function: a method of mathematical calculation applied, such as adding all the values together to calculate the average or filter out the highest (large) value. Interpolation: a method of dealing with missing values. For example, time series A has a value in T1, but time series B has no value.

this chapter focuses on how to use aggregation in the context of group by, for example, when merging multiple time series into one. In addition, you can use an aggregator to downsample the time series (that is, return a set of results with lower resolution). For more information, see downsampling. Polymerization

When aggregates or groups each time series collection into one, the timestamps in each time series are aligned. Then for each timestamp, the values in all time series are aggregated into a new value. That is, the aggregation will work on the time series of each timestamp. Treat the raw data as a matrix or table, as shown in the following example, which illustrates how the sum aggregator works on two time series, where An and B are merged to form a new Output time series.

Time series T0T0+10sT0+20sT0+30sT0+40sT0+50sA551015205B1052015100Output15151030305

sums the data points An and B of timestamp T0, that is, 5: 10: 15, in SQL:

Select sum (value) from ts_table group by timestamp

Interpolation.

in the above example, the time series An and B have data points for each timestamp, and they are neatly arranged. However, what happens when two sequences do not queue up? Synchronizing all data sources and writing at the same time can be difficult and sometimes undesirable. For example, if we have 10000 servers that send 100 system metrics every 5 minutes, we will burst 10m data points in one second. We need a very strong network and cluster to accommodate this traffic. Not to mention the system will be idle for 4 minutes and 59 seconds. On the contrary, it makes more sense to open writes over time, so we write an average of 3333 writes per second to reduce hardware and network requirements.

How does aggregate or find an average of a number that does not exist? The first step is to return a valid data point and complete it. However, what about data sources that deal with thousands of simple unmatched data points as above? For example, the following figure shows a time series that is not aligned, resulting in jagged lines that are confusing:

Missing data:

Through "missing", it only means that the time series stamps out no data points at a given time. Typically, data is simply time-shifted before or after the requested timestamp, but if the source or TSD encounters an error and the data is not logged, it may actually be lost. Some time series databases may allow NaN to store a value that represents an unrecordable value in a timestamp, but OpenTSDB does not allow this.

or, you can simply ignore any data points of all time series that are missing a given timestamp. However, if there are two time series, and they are simply not aligned, even if there is normal data in the storage, the query will return an empty dataset, so this may not be very useful.

Another option for is to define a scalar value (such as the maximum value of 0 or long) to use if a data point is lost. OpenTSDB2.0 and later provide aggregate functions to replace the scalar values of missing data points. In fact, the above figure is generated using a zimsum aggregator, which replaces unaligned values with 0. This alternative can be useful when dealing with time series with different values, such as total sales in a given time, but not when processing averages or verifying that visual charts look good (flat).

One solution provided by OpenTSDB is to guess the value of this point in time using a well-defined interpolation numerical analysis method. Interpolation uses existing time series data points to calculate the best guess of the required timestamp. Using OpenTSDB's linear interpolation, the unaligned graph can be smoothed to get:

for a numerical example, look at these two time series, where the source emits a value in less than 20 seconds, and the data is simply offset by 10 seconds:

Time series T0T0+10sT0+20sT0+30sT0+40sT0+50sT0+60sAna5na15na5naB10na20na10na20

when OpenTSDB calculates an aggregation, it starts at the first data point of any sequence, in which case it will be the t0 data point of the B sequence. We need the value of time series An at t0, but there is no data here. We know that time series A has a data value at t0,10 seconds. But since we don't have any value before that, we can't guess what it will be. Therefore, we can only return the value of time series B.

next we ask for the value of time series B at t0 minutes 10 seconds at the timestamp, but there is no. But the time series B has a value at t0 seconds 20 seconds and a value at t 0, so we can now calculate and guess the value of t0 seconds 10 seconds. The formula of linear interpolation is y ~ 0 + (y1-y0) ((x-x0) / (x1-x0)). For sequence B, y _ 0 ~ 10, y _ 1 ~ 20, x=t0+10 (or 10), x0=t0 (or 0), x1=t0+20 (or 20), so: y ~ 0 + (20-10) ((10-0) / (20-0)) = 15. So sequence B gives us an estimate of 15 at t0,10 seconds.

Iterations continue for each time stamp of data points found for each sequence returned as part of the query. The sequence of results generated using the sum aggregator will be as follows:

Time series T0T0+10sT0+20sT0+30sT0+40sT0+50sT0+60sAna5na15na5naB10na20na10na20 interpolation A1010 interpolation B151515 B10na201510na20Summed Result10203020202020

The original form is as follows:

More examples of :

We have the following example for graphic tilt. A fictional metric m is recorded in OpenTSDB. "sum of m" is the top blue line generated by the query start=1h-ago&m=sum:m. It consists of the sum of the red line host=foo and the green line host=bar.

looks intuitive from the image above, and if you "overlay" red and green lines, you get blue lines. At any discrete point in time, the value of the blue line is equal to the sum of the red line value and the green line value. If there is no interpolation, you will get something less intuitive, which is difficult to understand, and it is not so meaningful and useful.

notices how the blue line drops to the green data point at 18:46:48. You don't need to be a mathematician or an advanced math course to see that interpolation is needed to correctly aggregate multiple time series and get meaningful results.

currently OpenTSDB mainly supports linear interpolation (abbreviation "lerp") and some simple aggregators that replace 0 or maximum or minimum values. You are welcome to add other interpolation algorithms to patch.

interpolation is performed only when multiple time series are found to match the query. Many metrics acquisition systems interpolate when writing so that you never record the original value. OpenTSDB stores the original value and can retrieve it at any time.

this is another slightly more complex example from a mailing list that describes how multiple time series are aggregated by average:

's thick blue line with triangles is an aggregate of multiple time series avg functions based on the query (start=1h-ago&m=avg:duration_seconds). As we can see, the resulting time series has a data point on each timestamp of all the underlying time series it aggregates, and the data point is calculated by obtaining the average of all time series values of the timestamp. The same is true for lonely data points in purple squares, which temporarily increases the average until the next data point.

Note:

The aggregate function returns an integer or double value based on the input data point. If both original values are integers in storage, the result of the evaluation will be integers. This means that any decimal values calculated will be deleted and rounding will not occur. If any data point is a floating point value, the result will be a floating point number. However, if downsampling or Rate is enabled, the result will always be a floating point number.

Downsampling

as mentioned above, interpolation is a means of dealing with lost data. But some users do not like linear interpolation, it is a way to generate lie data, will produce phantom values. On the contrary, another way to deal with unaligned values is through downward adoption. For example, if the source reports a value every minute, but they have a time offset within that minute, then for each query in the source data, a downgrade is provided within 1 minute. This will have a significant value in each time series, and it is mostly avoidable to use the same timestamp as interpolation. Interpolation still occurs when the downsampled bucket loses a value.

for more information and examples of avoiding interpolation, see downsampling.

Note:

Generally speaking, is an ideal choice for downsampling each query that contains multiple time series.

Available aggregator

the following is a description of the aggregate functions available in OpenTSDB. Note that some should be used only for grouping, while others should be used for downsampling.

The aggregator TSD version describes the average number of original data points in the interpolation avg1.0 data point average linear interpolation count2.2 set 0 replaces the missing value dev1.0 calculates the standard deviation linear interpolation Ep50r32.2 calculates the estimated 50% linear interpolation Ep50r72.2 using the Rmuri 7 method calculates the estimated 50% linear interpolation Ep75r32.2 uses the Rmuri 3 method to calculate the estimated 75% linear interpolation Ep75r72. 2 using Rmur7 method to calculate 75% of the estimated linear interpolation Ep90r32.2 using Rmur3 method to calculate 90% of the estimated linear interpolation Ep90r72.2 using the Rmur7 method to calculate 90% of the estimated linear interpolation Ep95r32.2 using the Rmur3 method to calculate 95% of the estimated linear interpolation Ep95r72.2 using the Rmur7 method to calculate 95% of the estimated linear interpolation Ep99r32.2 using the Rmur3 method to calculate 99% of the estimate Linear interpolation Ep99r72.2 calculates 99% of the estimated linear interpolation Ep999r32.2 using the Rmuri 7 method calculates the estimated 999% linear interpolation Ep999r72.2 uses the Rmuri 7 method to calculate the estimated 999% linear interpolation first2.3 returns the first data point in the set. Useful only for downsampling, but for aggregating useless indefinite last2.3 to return the last data point in the collection. Useful only for downsampling Aggregate useless mimmin2.0 filter minimum data points linear interpolation mimmax2.0 screening maximum data points linear interpolation min1.0 screening minimum data points linear interpolation max1.0 screening maximum data points linear interpolation none2.3 through aggregation of all time series skip group 0 replace missing values p502.3 calculate 50% linear interpolation p752.3 calculate 75% linear interpolation p902.3 calculate 90% lines Sexual interpolation p952.3 calculation 95% linear interpolation p992.3 calculation 99% linear interpolation p9992.3 calculation 999% linear interpolation sum1.0 sums data points together linear interpolation zimsum1.0 sums data points together 0 replaces missing values

gets percentage calculations and reads Wikipedia articles. For the calculation of high cardinality, the performance of using estimated percentage is better.

Avg

calculates the average of all values in a drop sample bucket or across multiple time series. This function performs linear interpolation on the sequence of events. This is useful for viewing gauge metrics.

Note:

even though calculations usually result in floating-point values, if data points are recorded as integers, an integer is returned, losing some precision.

Count

returns the number of data points stored in a sequence or range. When used to aggregate multiple sequences, the missing value is replaced with 0. When used with sampling, it reflects the number of data points in each downsample bucket. When used for grouping aggregation, it reflects the value of a given time and the number of time series.

Dev

calculates the standard deviation of a bucket or time series. This function performs linear interpolation on the sequence of events. This is useful for viewing gauge metrics.

Note:

even though calculations usually result in floating-point values, if data points are recorded as integers, an integer is returned, losing some precision.

Estimated Percentiles

uses algorithm selection to calculate various percentages. These are useful for sequences with many data points, because some data may be kicked out of the calculation. When used to aggregate multiple sequences, this function performs linear interpolation. Please refer to Wikipedia for details. The implementation is done through the Apache Math library.

First & Last

The aggregators will return the first or last data point in the downsampling interval. For example, if the downsampling bucket consists of sequences 2, 6, 1, 7, then the first aggregator will return 2 (the original is 1) and the last aggregator will return 7. Note that the aggregator is only useful for downsampling.

Warning:

When is used as an group-by aggregator, the results are uncertain because the time series that are retrieved from storage and saved in memory are sorted from TSD to TSD or from execution to execution.

Min

returns the smallest data point in all time series or time range. This function performs linear interpolation on the sequence of events. It is useful to look at the lower limit of the gauge metric.

Max

, in contrast to min, returns the largest data point in any time series or time range. This function performs linear interpolation on the sequence of events. It is useful to look at the upper limit of the gauge metric.

MimMin

The maximum value with missing minimum function returns only the minimum data points in all time series or time ranges. This function does not perform interpolation, but returns the maximum value of the specified data type if the value is missing. This returns the integer Long.MaxValue or the floating-point value Double.MaxValue. Please refer to the original data type for details. This is useful for looking at the lower bound of gauge metrics.

MimMax

The maximum value when missing maximum function returns only the maximum data points in all time series or time ranges. This function does not perform interpolation, but returns the minimum value of the specified data type if the value is missing. This returns the integer Long.MinValue or the floating-point value Double.MinValue. Please refer to the original data type for details. This is useful for looking at the upper limit of gauge metrics.

None

skips groups through aggregation. This aggregator is useful for getting raw data from storage because it returns a result set for the time series of each matching filter. Note that the query will throw an exception when used with the drop sampler.

Percentiles

calculates various percentages. When used to aggregate multiple sequences, this function performs linear interpolation. The implementation is done through the Apache Math library.

Sum

If downsampled, calculates the sum of all data points in all time series or time ranges. It is the default aggregate function for GUI because it is usually most useful when combining multiple time series, such as gauges or count. When data points cannot be arranged, it performs linear interpolation. If you have values from different sequences and you don't need interpolation when you want to sum, take a look at the zimsum function.

ZimSum

calculates the sum of all data points based on the timestamps specified in all time series or time ranges. This function cannot be interpolated, but instead replaces the missing data point with 0. This is useful when using discrete values.

Aggregator list

calls the HTTP interface / api/aggregators in the HTTP API-enabled TSD, which lists the aggregators implemented by TSD.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.