Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Apache Calcite official documentation Chinese version-Advanced-3. Stream (Streaming)

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

The second part is advanced (Advanced) 3. Stream (Streaming)

   Calcite extends SQL and relational algebra to support streaming queries.

3.1 introduction

A    stream is a collection of records that flow continuously and never stop. Unlike tables, they are not usually stored on disk, while streams pass through the network and are kept in memory for a short period of time.

   data streams complement tables because they represent what is happening to the enterprise now and in the future, while tables represent the past. It is common for a stream to be archived to a table.

  , like tables, you often want to query flows in a high-level language based on relational algebra, validate against schemas (schema), and optimize to make full use of available resources and algorithms.

   Calcite's SQL is an extension of the standard SQL, not another "SQL-like" language. The difference is important for the following reasons:

For anyone who knows regular SQL, streaming SQL is easy to learn. The semantics are clear, because our goal is to produce the same results on a stream, as if the data in the table are the same. You can write queries that combine flows with tables (or the history of streams, basically memory tables). Many existing tools can generate standard SQL.

   returns regular standard SQL if it does not use the STREAM keyword.

3.2 schema example

   streaming SQL uses the following schema:

Orders (rowtime, productId, orderId, units): a stream and a table Products (rowtime, productId, name): a table Shipments (rowtime, orderId): a stream 3.3 simple query

   's simplest streaming query:

SELECT STREAM * FROM Orders Rowtime | productId | orderId | units-+- 10:17:00 | 30 | 5 | 4 10:17:05 | 10 | 6 | 1 10:18:05 | 20 | 7 | 2 10:18:07 | 30 | 8 | | 20 11:02:00 | 10 | 9 | 6 11:04:00 | 10 | 10 | 1 11:09:30 | 40 | 11 | 12 11:24:11 | 10 | 12 | 4 |

   this query reads all columns and rows in the Orders stream. Like any streaming query, it never terminates. As soon as the record arrives, it outputs a record Orders.

   enter Control-C to terminate the query.

The    STREAM keyword is the main extension of the SQL stream. It tells the system that you are interested in an order, not an existing order.

   query:

SELECT * FROM Orders; rowtime | productId | orderId | units-+- 08:30:00 | 10 | 1 | 3 08:45:10 | 20 | 2 | 1 09:12:21 | 10 | 3 | 10 09:27:44 | 30 | 4 | 24 records returned.

   is also valid, but all orders that appear will be printed and terminated. We call it relational query, not streaming. It has traditional SQL semantics.

   Orders is special because it has a stream and a table. If you try to run a streaming query on a table or a relational query on a stream, Calcite throws an error:

SELECT * FROM Shipments;ERROR: Cannot convert stream 'SHIPMENTS' to a tableSELECT STREAM * FROM Products;ERROR: Cannot convert table' PRODUCTS' to a stream3.4 filter rows

  , as in regular SQL, uses a WHERE clause to filter rows:

SELECT STREAM * FROM OrdersWHERE units > 3 Rowtime | productId | orderId | units-+- 10:17:00 | 30 | 5 | 4 10:18:07 | 30 | 8 | 20 11:02:00 | 10 | 9 | 6 11:09:30 | 40 | 11 | 12111RV 24Plus 11 | 10 | 12 | 43.5 expression projection

   uses an expression in the SELECT clause to select the column to return or evaluate the expression:

SELECT STREAM rowtime,'An order for'| | units | |''| | CASE units WHEN 1 THEN 'unit' ELSE' units' END | |'of product #'| | productId AS descriptionFROM Orders Rowtime | description-+-- 10:17:00 | An order for 4 units of product # 30 10:17:05 | An order for 1 unit of product # 10 10:18:05 | An order for 2 units of product # 20 10:18:07 | An order for 20 units of product # 30 11:02:00 | An order by 6 units of product # 10 11:04:00 | An order by 1 unit of product # 10 11:09:30 | An order for 12 units of product # 40 11:24:11 | An order by 4 units of product # 10

   We recommend that you always include the rowtime column in the SELECT clause. There is an ordered timestamp in each stream and streaming query, and advanced calculations can be performed later, such as GROUP BY and JOIN.

3.6 Scroll window

   has several ways to calculate aggregate functions on a stream. The differences are:

How many rows come out for each row in? Does each input value appear once or more times? What defines a "window", a set of rows that contribute to the output line? The result is flow or relationship?

   window type:

Scroll window (GROUP BY) Jump window (Multi-GROUP BY) (hopping) sliding window (window function) Cascade window (window function)

The following figure of    shows the types of queries that use them:

   first, look at a scroll window, which is defined by a stream GROUP BY. Here is an example: SELECT STREAM CEIL (rowtime TO HOUR) AS rowtime, productId, COUNT (*) AS c, SUM (units) AS unitsFROM OrdersGROUP BY CEIL (rowtime TO HOUR), productId Rowtime | productId | c | units-+- 11:00:00 | 30 | 2 | 24 11:00:00 | 10 | 1 | 1 11:00:00 | 20 | 1 | 7 12:00:00 | 10 | | 3 | 11 12:00:00 | 40 | 1 | 12 |

The    result is a stream. At 11:00 sharp, Calcite issued a subtotal of the productId with orders from 10:00 to 11:00. At 12:00, it will place orders between 11:00 and 12:00. Each input line contributes to only one output line.

How does    Calcite know that the 10:00:00 subtotal is completed at 11:00:00 so that they can be sent out? It knows that rowtime is increasing, and it also knows that CEIL (rowtime TO HOUR) is increasing. So, once you see a row at or after 11:00:00 in time, it will never see a row contributed to 10:00:00

The columns and expressions that    adds or decreases are monotonous. (monotonously increasing or decreasing)

   if the value of a column or expression is slightly out of order and the stream has a mechanism for declaring that a particular value will no longer be seen (such as punctuation or watermark), the column or expression is called quasi-monotone.

   cannot make progress without monotonous or quasi-monotone expressions in the GROUP BY clause, and queries are not allowed:

SELECT STREAM productId,COUNT (*) AS c main sum (units) AS unitsFROM OrdersGROUP BY productId;ERROR: Streaming aggregation requires at least one monotonic expression

   monotonous and quasi-monotonous columns need to be declared in the schema. Monotonicity is enforced when an input stream is recorded and a hypothetical query reading data from the stream is made. We recommend specifying a timestamp column rowtime for each stream, but you can also declare other columns to be monotonous, such as orderId.

   We will discuss punctuation, watermarks, and other ways to make progress in the following content.

3.7 Scroll window, improvement

The example of the scrolling window in front of    is easy to write because the window is an hour. For intervals that are not the entire unit of time, such as 2 hours or 2 hours and 17 minutes, CEIL cannot be used and the expression becomes more complex.

   Calcite supports alternative syntax for scrolling windows:

SELECT STREAM TUMBLE_END (rowtime, INTERVAL'1' HOUR) AS rowtime, productId, COUNT (*) AS c, SUM (units) AS unitsFROM OrdersGROUP BY TUMBLE (rowtime, INTERVAL'1' HOUR), productId Rowtime | productId | c | units-+- 11:00:00 | 30 | 2 | 24 11:00:00 | 10 | 1 | 1 11:00:00 | 20 | 1 | 7 12:00:00 | 10 | 3 | 11 12:00:00 | 40 | 1 | 12

  , as you can see, returns the same result as the previous query. The TUMBLE function returns a grouping key that will end in the same way in a given summary line; the TUMBLE_END function takes the same parameters and returns the end time of the window; and of course there is a TUMBLE_START function.

   TUMBLE has an optional parameter to align the window. In the following example, we use the 30-minute interval and 0:12 as the alignment time, so the query issues a summary in the past 12 minutes and 42 minutes per hour:

SELECT STREAM TUMBLE_END (rowtime, INTERVAL '30' MINUTE, TIME' 0INTERVAL 12') AS rowtime, productId, COUNT (*) AS c, SUM (units) AS unitsFROM OrdersGROUP BY TUMBLE (rowtime, INTERVAL '30' MINUTE, TIME' 0INTERVAL 12'), productId Rowtime | productId | c | units-+- 10:42:00 | 30 | 2 | 24 10:42:00 | 10 | 1 | 1 10:42:00 | 20 | 1 | 7 11:12:00 | 10 | 2 | 7 11:12:00 | 40 | 1 | 12 11:42:00 | 10 | 1 | 43.8 Jump window

The    jump window is a generalization (generalization) of the scrolling window, which allows data to remain in the window for longer than the interval at which it is issued.

The timestamp of the row issued by the    query is 11:00, containing data from 08:00 to 11:00 (or 10 59.9), and the timestamp of the row is 12:00, containing data from 09:00 to 12:00.

SELECT STREAM HOP_END (rowtime, INTERVAL'1' HOUR, INTERVAL'3' HOUR) AS rowtime, COUNT (*) AS c, SUM (units) AS unitsFROM OrdersGROUP BY HOP (rowtime, INTERVAL'1' HOUR, INTERVAL'3' HOUR); rowtime | c | units-+-+- 11:00:00 | 4 | 27 12:00:00 | 8 | 50

   in this query, because the retention period is three times the departure period, each input line contributes three output lines. Imagine that the HOP function generates a set of Group Keys for the incoming row and stores its value in the accumulator of each Group Key. For example, HOP (10:18:00, INTERVAL'1' HOUR, INTERVAL'3') produces three intervals:

[08:00, 09:00]

[09:00, 10:00]

[10:00, 11:00]

   raises the possibility of allowing users who are not satisfied with the built-in functions HOP and TUMBLE to customize partition functions.

   we can build complex expressions, such as moving averages with exponential decay:

SELECT STREAM HOP_END (rowtime), productId, SUM (unitPrice * EXP ((rowtime-HOP_START (rowtime)) SECOND / INTERVAL'1' HOUR) / SUM (EXP ((rowtime-HOP_START (rowtime)) SECOND / INTERVAL'1' HOUR)

Issue:

1:00:00 contains the line [10:00:00, 11:00:00); 1:00:01 contains the line [10:00:01, 11:00:01).

This expression indicates that the recent order has a higher weight than the old order. Expanding the window from 1 hour to 2 hours or 1 year has little impact on the accuracy of the results (but uses more memory and computing resources).

   notice that we use HOP_START in an aggregate function (SUM) because it is the constant value of all rows in a subsummary (sub-total). For typical aggregate functions (SUM,COUNT, etc.), this is not allowed.

   if you are familiar with GROUPING SETS, you may notice that partition functions can be thought of as generalized GROUPING SETS because they allow a single input line to contribute to multiple subtotals. Helper functions such as GROUPING () and GROUP_ID for GROUPING SETS can be used inside aggregate functions, so it's not surprising that HOP_START and HOP_END can be used in the same way. 3.9 grouping set

   GROUPING SETS is valid for streaming queries, as long as each grouping collection contains monotone or quasi-monotone expressions.

   CUBE and ROLLUP are not suitable for streaming queries because they generate at least one grouped collection that aggregates all content, such as GROUP BY ().

3.10 Post-polymerized Consideration

  , like standard SQL, can use the HAVING clause to filter rows emitted by the stream GROUP BY:

SELECT STREAM TUMBLE_END (rowtime, INTERVAL'1' HOUR) AS rowtime, productIdFROM OrdersGROUP BY TUMBLE (rowtime, INTERVAL'1' HOUR), productIdHAVING COUNT (*) > 2 OR SUM (units) > 10 productIdHAVING COUNT rowtime | productId-+-10:00:00 | 3011 OR SUM 0000 | 103.11 subquery, view and SQL closure attributes

The HAVING query mentioned by    can be expressed using a clause in the where subquery:

SELECT STREAM rowtime, productIdFROM (SELECT TUMBLE_END (rowtime, INTERVAL'1' HOUR) AS rowtime, productId, COUNT (*) AS c, SUM (units) AS su FROM Orders GROUP BY TUMBLE (rowtime, INTERVAL'1' HOUR), productId) WHERE c > 2 OR su > 10 | productId-+- 10:00:00 | 30 11:00:00 | 10 11:00:00 | 40

The    HAVING clause was introduced in the early days of SQL, when the filter needs to be executed after the aggregation (recall that WHERE filters rows before the input reaches the GROUP BY clause).

   since then, SQL has become a mathematically closed language, which means that anything you can do on a table can also be performed on a query.

The closure property of    SQL is very powerful. Not only does it make HAVING obsolete (or at least reduced to syntactic sugar), it makes views possible:

CREATE VIEW HourlyOrderTotals (rowtime, productId, c, su) AS SELECT TUMBLE_END (rowtime, INTERVAL'1' HOUR), productId, COUNT (*), SUM (units) FROM Orders GROUP BY TUMBLE (rowtime, INTERVAL'1' HOUR), productId;SELECT STREAM rowtime, productIdFROM HourlyOrderTotalsWHERE c > 2 OR su > 10; rowtime | productId-+- 10:00:00 | 30 11:00:00 | 10 11:00:00 | 40

Subqueries in the    FROM clause are sometimes called "inline views," but in fact they are more basic than views. Views are just a convenient way to split SQL into manageable blocks by naming these shards and storing them in a metadata repository.

Many people in    find that nested queries and views are more useful in flows than in relationships. Streaming queries are pipes of operators that run continuously, and these pipes are usually very long. Nested queries and views help express and manage these pipes.

   by the way, the WITH clause does the same thing as a subquery or view:

WITH HourlyOrderTotals (rowtime, productId, c, su) AS (SELECT TUMBLE_END (rowtime, INTERVAL'1' HOUR), productId, COUNT (*), SUM (units) FROM Orders GROUP BY TUMBLE (rowtime, INTERVAL'1' HOUR), productId) SELECT STREAM rowtime, productIdFROM HourlyOrderTotalsWHERE c > 2 OR su > 10 Rowtime | productId-+- 10:00:00 | 30 11:00:00 | 10 11:00:00 | conversion between 403.12 streams and relationships

   reviews the definition of the HourlyOrderTotals view. Is this view a stream or a relationship?

   it does not contain the STREAM keyword, so it is a relationship. However, this is a relationship that can be converted into a stream.

   can use it in relational and streaming queries:

# A relation; will query the historic Orders table.# Returns the largest number of product # 10 ever sold in one hour.SELECT max (su) FROM HourlyOrderTotalsWHERE productId = 10 X # A stream; will query the Orders stream.# Returns every hour in which at least one product # 10 was sold.SELECT STREAM rowtimeFROM HourlyOrderTotalsWHERE productId = 10

The    approach is not limited to views and subqueries. Following the method specified in CQL [1], each query in streaming SQL is defined as a relational query, and the topmost SELECT is converted to a stream using the STREAM keyword.

   does not work if the STREAM keyword exists in a subquery or view definition.

When    prepares the query, Calcite calculates whether the relationship referenced in the query can be converted to a stream or historical relationship.

   sometimes, a stream can provide some of its history (such as the last 24 hours of data in the Apache Kafka [2] topic), but not all. At run time, Calcite calculates whether there is enough history to run the query, and if not, an error is given.

3.13 "pie chart" problem: relational query on stream

In a specific case of   , what I call a "pie chart problem" occurs when you need to convert a flow into a relationship. Imagine that you need to write a web page with a chart, as shown below, which summarizes the number of orders for each product in the past hour.

   but this Orders stream contains only a few records, not an hour's summary. We need a history of convection to run a relational query:

SELECT productId, count (*) FROM OrdersWHERE rowtime BETWEEN current_timestamp-INTERVAL'1' HOUR AND current_timestamp

   if the history of the Orders stream is scrolling into the Orders table, despite the high cost, we can answer the query. Even better, if we can tell the system to convert an hour's summary into a table, maintain it continuously during streaming, and automatically rewrite the query to use the table.

3.14 sort

The story of    ORDER BY is similar to GROUP BY. The syntax looks like a normal SQL, but Calcite must ensure that it provides timely results. Therefore, it requires a monotonous expression at the front of the ORDER BY key (leading edge).

SELECT STREAM CEIL (rowtime TO hour) AS rowtime, productId, orderId, unitsFROM OrdersORDER BY CEIL (rowtime TO hour) ASC, units DESC Rowtime | productId | orderId | units-+- 10:00:00 | 30 | 8 | 20 10:00:00 | 30 | 5 | 4 10:00:00 | 20 | 7 | 2 10:00:00 | 10 | 6 | 1 11:00:00 | 40 | 11 | 12 11:00:00 | 10 | 9 | 6 11:00:00 | 10 | 12 | 4 11:00:00 | 10 | 10 | 1

Most    queries will return results in the order in which they are inserted, because the reference uses a streaming algorithm, but should not rely on it. For example, consider:

SELECT STREAM * FROM OrdersWHERE productId = 10UNION ALLSELECT STREAM * FROM OrdersWHERE productId = 30 Rowtime | productId | orderId | units-+- 10:17:05 | 10 | 6 | 1 10:17:00 | 30 | 5 | 4 10:18:07 | 30 | 8 | 20 11:02:00 | 10 | 9 | 6 11:04:00 | 10 | 10 | 1 11:24:11 | 10 | 12 | 4

The rows of    productId= 30 obviously do not meet the order requirements, probably because the Orders streams are partitioned with productId, and the partitioned streams send their data at different times.

   if you need a specific order, add an explicit ORDER BY:

   Calcite may implement UNION ALL by merging and using rowtime, which is only slightly less efficient.

Just add an ORDER BY to the outermost query. If you need to execute GROUP BY,Calcite after UNION ALL, ORDER BY will be implicitly added to make the GROUP BY algorithm possible.

3.15 Table constructor

The    VALUES clause creates an inline table with a given rowset.

   streaming is not allowed. This set of rows does not change, so a stream never returns any rows.

> SELECT STREAM * FROM (VALUES (1, 'abc')); ERROR: Cannot stream VALUES3.16 sliding window

One of the features of the    standard SQL can use the so-called "parsing function" in the SELECT clause. Unlike GROUP BY, records are not collapsed. For each incoming record, come out a record. But the aggregate function is based on a multi-line window.

  , let's look at an example.

SELECT STREAM rowtime, productId, units, SUM (units) OVER (ORDER BY rowtime RANGE INTERVAL'1' HOUR PRECEDING) unitsLastHour

   is a feature that includes a lot of Power with little effort. There can be multiple functions in the SELECT clause, based on multiple window rules.

   the following example returns orders for which the average order quantity in the past 10 minutes is greater than the average order quantity last week.

SELECT STREAM * FROM (SELECT STREAM rowtime, productId, units, AVG (units) OVER product (RANGE INTERVAL '10' MINUTE PRECEDING) AS M10, AVG (units) OVER product (RANGE INTERVAL' 7' DAY PRECEDING) AS d7 FROM Orders WINDOW product AS (ORDER BY rowtime PARTITION BY productId))

   for brevity, we use a syntax here where we use the WINDOW clause to partially define the window and then refine the window in each OVER clause. You can also define all windows in the WINDOW clause, or you can define all inline windows if you like.

   but the real power goes beyond grammar. Behind the scenes, this query maintains two tables and uses FIFO queues to add and remove values from subtotals. However, these tables can be accessed without introducing joins in the query.

Some other functional features of windowed aggregation syntax:

You can define windows based on the number of rows. This window can reference rows that have not yet been reached. The stream will wait until they arrive.

You can calculate order-related functions, such as RANK median.

3.17 cascade window

   what if we want a query that returns the results of each record, such as a sliding window, but resets the total over a fixed period of time, like a tumbling window? This mode is called cascading windows. Here is an example:

SELECT STREAM rowtime,productId,units,SUM (units) OVER (PARTITION BY FLOOR (rowtime TO HOUR)) AS unitsSinceTopOfHour

   it looks like a sliding window query, but monotonous expressions appear in the clause of the PARTITION BY window. Because the rowtime changes from 10:59:59 to 11, the rowtime TO HOUR changes from 10:00:00 to 11:00:00, a new partition begins. The first line that arrives at the new time will start a new summary; the second line will have a summary of two lines, and so on.

   Calcite knows that the old partition will never be used again, so all subtotals for that partition are removed from the internal storage.

   can be combined in the same query using cascading and sliding window analysis functions.

3.18Stream and Table Join

   has two types of connections, stream-to-table join and stream-to-stream join.

   if the contents of the table have not changed, the join that flows to the table is direct. This query enriches the order stream with the listed price of each product:

SELECT STREAM o.rowtime, o.productId, o.orderId, o.units.name, p.unitPriceFROM Orders AS o JOIN Products AS p ON o.productId = p.productId Rowtime | productId | orderId | units | name | unitPrice-+-+-10:17:00 | 30 | 5 | 4 | Cheese | 1710 rowtime | 10 | 6 | 1 | Beer | 0.2510 Beer | 0.2510 Beer | 0.2511 | 9 | Beer | 0.2511 | 0 / 9 | 9 | 90 | 30 | 40 | 11 | 12 | Bread | 10011 24 11 | 10 | 12 | 4 | Beer | 0.25

  , what happens if the table is changing? For example, suppose the unit price of product#10 increases to .35 at 11:00. Orders placed before 11:00 should be the old price, and orders placed after 11:00 should reflect the new price.

One way to do this is to create a table so that the start and end effective dates of each version are the same, as shown in ProductVersions:

SELECT STREAM * FROM Orders AS o JOIN ProductVersions AS pON o.productId = p.productIdAND o.rowtime BETWEEN p.startDate AND p.endDaterowtime | productId | orderId | units | productId1 | name | unitPrice-+-+-10:17:00 | | 30 | 5 | 4 | 30 | Cheese | 1710 Wine 1710 | 10 | 6 | 1 | 10 | Beer | 0.2510 | 18 Wine | 20 | 7 | 2 | 20 | 610 | 20 | 30 | 8 | 20 | 30 | Cheese | 1711Bread 02Bread | 10 | 9 | 6 | 10 | Beer | 0.3511 | 1 | 10 | Beer | 0.3511 | Bread | 1001124 Bread | 10 | 12 | 4 | 10 | Beer | 0.35 |

Another way to implement    is to use a database with temporary support (the contents of the database can be found as at any time in the past), and the system needs to know that the rowtime column of the Orders stream corresponds to the transaction timestamp of the Products table.

   for many applications, the cost and effort of temporarily supporting or versioning tables is not worth it. It is acceptable for the query to give different results on replay: in this example, all orders for product#10 are assigned a subsequent unit price of 0.35 on replay.

3.19Stream and streaming Join

   if the connection condition forces them to keep a finite distance from each other in some way, then the connection between the stream and the stream is reasonable. In the following query, the ship date is within one hour of the order date:

SELECT STREAM o.rowtime, o.productId, o.orderId, s.rowtime AS shipTimeFROM Orders AS o JOIN Shipments AS sON o.orderId = s.orderIdAND s.rowtime BETWEEN o.rowtime AND o.rowtime + INTERVAL'1' HOUR Rowtime | productId | orderId | shipTime-+-10:17:00 | 30 | 5 | 10 orderId | 10 | 12 | 11:44:00 | 10 | 6 | 10 / 20 / 20 / 11 / 02 / 00 | 10 | 9 | 11 / 11 / 58 / 11 / 24 / 24 / 11 | 10 | 12 | 11:44:00

   please note that a considerable number of orders will not be displayed because they have not been shipped within an hour. After the system receives Order# 10:00 with a timestamp of 11:24:11, it has deleted the order including Order#8 (timestamp 10:18:07) from its hash table.

   as you can see, the "locking step" that links the monotone or quasi-monotone columns of these two streams is necessary for the system to make progress. If it cannot infer a locking step, it refuses to execute a query.

3.20 DML

   this not only makes sense in terms of query convection. It also makes sense to run DML statements (INSERT,UPDATE,DELETE,UPSERT and REPLACE) for convection.

   DML is very useful because it allows flow-based realizations or tables, so using values frequently can save effort.

   takes into account that applications for streams usually consist of query pipelines, where each query converts an input stream into an output stream. The component of a pipe can be a view:

CREATE VIEW LargeOrders ASSELECT STREAM * FROM Orders WHERE units > 1000

Or a standard INSERT statement:

INSERT INTO LargeOrdersSELECT STREAM * FROM Orders WHERE units > 1000

   these look very similar, in both cases, the next step in the pipe can read LargeOrders without worrying about how it is populated. Efficiency is different: INSERT does the same job no matter how many consumers it has. This view is indeed proportional to the number of consumers, especially if there are no consumers, there is no job.

   other forms of DML also make sense for streams. For example, the following permanent UPSERT statement maintains a table to summarize orders for the last hour:

UPSERT INTO OrdersSummarySELECT STREAM productId,COUNT (*) OVER lastHour AS cFROM OrdersWINDOW lastHour AS (PARTITION BY productIdORDER BY rowtimeRANGE INTERVAL'1' HOUR PRECEDING) 3.21 punctuation (Punctuation)

   Punctuation [5] allows streaming queries to make progress, even if there are not enough values in the monotonous keys to push the results.

   (I prefer the term "rowtime bounds". Watermark [6] is a related concept, but Punctuation is sufficient for these purposes. )

   if a stream has Punctuation, it may not be sorted, but it can still be sorted. Therefore, for semantic purposes, it is sufficient to work according to the sorted flow.

   by the way, an unordered stream can also be sorted if it is sorted by t-sorted (that is, each record is guaranteed to arrive within t seconds of its timestamp) or k-sorted (that is, each record is guaranteed not to exceed k to cause disorder). So queries for these streams can be planned like streaming queries with Punctuation.

   also, we often aggregate non-temporal and monotonous attributes. "the number of times a team transitions between winning and losing states" is such a monotonous attribute. The system needs to figure out for itself whether it is safe to aggregate such an attribute; Punctuation does not add any additional information.

   I remember some of the planner's metadata (cost indicators):

Is this stream sorted by a given attribute or attributes? Can the streams of a given property be sorted? For limited relationships, the answer is always "yes"; for streams, it depends on the existence of Punctuation, or the connection between attributes and sort keys. What delays do we need to introduce to perform such operations? What is the cost (CPU, memory, etc.) to perform such operations?

   in BuiltInMetadata.Collation, we already have (1). For (2), the answer is always "true" for finite relationships. But we need to implement (2), (3) and (4) for the stream. 3.22 state of stream

Not all the concepts in this article have been implemented in    in Calcite. Others may be implemented in Calcite, but not in specific adapters such as SamzaSQL [3] [4].

Realized

Streaming SELECT,WHERE,GROUP BY,HAVING,UNION ALL,ORDER BYFLOOR and CEIL functions monotonicity streaming VALUES is not allowed

Not realized

The following features provided in this document are supposed to be supported by Calcite, but in fact they have not yet been implemented. Full support means that the reference implementation supports this feature (including negative cases), which is tested by TCK.

JOIN flow of flow and flow and flow on JOIN view of table with relationship query on ORDER BY flow UNION ALL (merge) flow window aggregation (sliding and cascading windows) check whether STREAM is ignored in subqueries and views check that the ORDER BY clause of the flow cannot have OFFSET or LIMIT history finiteness; at run time, check that there is enough history to run the query. Quasi-monotone HOP and TUMBLE (and auxiliary HOP_START,HOP_END, TUMBLE_START,TUMBLE_END) functions

What did the    document do?

Revisit whether the VALUESOVER clause can be streamed to define the flow on the window considering whether CUBE and ROLLUP are allowed in the streaming query, understanding that some levels of aggregation will never be completed (because they do not have monotonous expressions) and therefore will not be issued. Fix the UPSERT example to delete a record of a product that did not occur in the past hour. DML; output to multiple streams may be an extension of the standard REPLACE statement. 3.23 function

   the following functions do not exist in standard SQL, but are defined in streaming SQL.

Scalar function:

FLOOR (dateTime TO intervalType) lowers the date, time, or timestamp value to the given interval type CEIL (dateTime TO intervalType) and captures the date, time, or timestamp value to the given interval type.

Partition function:

HOP (t, emit, retain) returns a collection of group keys for a row as part of the jump window HOP (t, emit, retain, align) returns a collection of group keys for a row as part of the given aligned jump window TUMBLE (t, emit) returns a group key for a row as part of the scroll window TUMBLE (t, emit, align) returns a group key for a row as part of the given alignment scroll window

Note:

TUMBLE (t, e) is equivalent to TUMBLE (t, e, TIME'00).

TUMBLE (t, e, a) is equivalent to HOP (t, e, e, a).

HOP (t, e, r) is equivalent to HOP (t, e, r, TIME '00lv 00')

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report