Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The thought of flink dynamic Table

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "the train of thought of flink dynamic table". In the daily operation, I believe that many people have doubts about the train of thought of flink dynamic table. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "the train of thought of flink dynamic table". Next, please follow the editor to study!

There is still a big difference between traditional database SQL and real-time SQL processing. Here are some simple differences:

Traditional database SQL processing

Real-time SQL processing

There is a limit to the table data of traditional database.

Real-time data has no boundaries.

In the query of batch data, you need to get all the data.

Unable to get full data, must wait for new data input

It was terminated after the processing.

Constantly update its result table with the input data, and will never stop.

Despite these differences, it is not impossible to use relational queries and SQL to process flows. Advanced relational database systems provide functionality called materialized views. Materialized views are defined as SQL queries, just like regular virtual views. Compared with the virtual view, the materialized view caches the results of the query so that the query does not need to be executed when accessing the view. A common challenge for caching is to avoid caching providing outdated results. Materialized views become obsolete when modifying the base table in which they define the query. Eager View Maintenance is a technique for updating instantiated views immediately after updating the base table.

The link between the Eager View Maintenance and the SQL query on the stream becomes obvious if we consider the following:

Database tables are the result of INSERT,UPDATE and DELETEDML statement flows, often referred to as update log flows.

Materialized views are defined as SQL queries. In order to update the view, the query needs to continuously process the change log flow of the view source table.

Materialized views are the result of streaming SQL queries.

With the above foundation, we can introduce the concept of dynamic table.

Dynamic tables and continuous queries

Dynamic tables flink table api and SQL deal with the core concepts of streaming data. Compared with static tables, dynamic tables change over time, but you can query dynamic tables like static tables, except that querying dynamic tables requires continuous queries. A continuous query is never terminated and a dynamic table is generated as the result table. The query constantly updates its (dynamic) result table to reflect changes to its (dynamic) input table. In the end, continuous queries on dynamic tables are very similar to queries that define materialized views.

It is worth noting that the result of a continuous query is always semantically equivalent to the same query result obtained by executing a batch on a snapshot of the input table.

The following figure shows the relationship between streams, dynamic tables, and consecutive queries:

Data flow is converted to dynamic table

Continuous queries are executed on the resulting dynamic table to produce a dynamic result table.

As a result, the dynamic table is converted to data flow again.

Note: the most important thing about dynamic tables is logical concepts. Dynamic tables are not necessarily (fully) materialized during query execution.

In the following sections, dynamic tables and continuous queries are explained with a stream of click events like schema.

[user: VARCHAR, / / the name of the user cTime: TIMESTAMP, / / the time when the URL was accessed url: VARCHAR / / the URL that was accessed by the user]

Convert stream to table

Of course, if you want to use classic sql to analyze streaming data, you must first convert it to a table. Conceptually, each new record of the flow is interpreted as an Insert operation on the result table. In the end, it can be understood as building a table from an INSERT-only changelog stream.

The following figure shows how the click event flow (left) is converted to a table (right). As more clickstream records are inserted, the resulting tables grow.

Note: the interior of the stream converted table is not materialized.

Continuous query

Execute a continuous query on a dynamic table and generate a new dynamic table as the result table. Unlike a batch query, a continuous query never terminates and updates its result table based on updates to the input table. At any point in time, the result of a continuous query is semantically equivalent to the result of a query in batch mode on a snapshot of the input table.

In the following sections, we will show two sample queries on the clicks table defined with the click event stream.

The first query is a simple GROUP-BY COUNT aggregation query. The main purpose of this paper is to group the clicks table according to user, and then count the number of visits to url. The following figure shows how the query of the clicks table is executed during data growth.

Suppose that when the query starts, the clicks table is empty. When the first row of data is inserted into the clicks table, the query begins to calculate the resulting table. When [Mary,. / home] is inserted, the query produces a row [Mary, 1] on the result table. When [Bob,. / cart] is inserted into the clicks table, the query updates the result table again, adding a row [Bob, 1]. When the third row, [Mary,. / prod?id=1] is inserted into the clicks table, the query updates the result table with [Mary, 1] as [Mary, 2]. Finally, after the fourth row of data is inserted into clicks, the query adds a row to the result table [Liz, 1].

The second query only adds an 1-hour scrolling window to the previous query. The following picture shows the whole flow process.

This is similar to batch processing, where the calculation results are generated every hour and then the result table is updated. The cTime has a total of four rows of data between 12:00:00 and 12:59:59, and the query calculates two rows of results and appends them to the result table. The Ctime window at 13:00:00 and 13:59:59 has a total of three rows of data, and the query produces two rows of results appended to the result table again. Over time, click data will be appended to the clicks table, and new results will continue to be generated in the resulting table.

Update and append query

Although the two sample queries look very similar (both calculate group count aggregations), the internal logic is quite different:

The first query updates the previously issued result, that is, the change log flow of the result table contains INSERT and UPDATE changes.

The second query only append to the result table, that is, the change log flow of the result table contains only INSERT changes.

There are some differences between whether the query generates only the append table or the update table:

Queries that produce update changes usually have to maintain more state.

Converting only append tables to streams is different from converting update tables to streams.

Query restriction

Not all queries can be executed in the format of stream queries. Because some queries are relatively expensive to calculate, either the state to be maintained is relatively large, or the cost of calculating updates is high.

State size: continuous queries are executed on unbounded streams and should usually run for weeks or months, or even 24 hours a day. As a result, the total amount of data processed by continuous queries can be very large. In order to update the previously generated results, you may need to maintain all output rows. For example, the first sample query needs to store the URL count for each user so that it can be increased and new results are issued when the input table receives a new row. If only registered users are counted, the count to maintain may not be too high. However, if an unregistered user is assigned a unique user name, the count to maintain will increase over time, which may eventually cause the query to fail.

SELECT user, COUNT (url) FROM clicksGROUP BY user

Calculate updates: sometimes even if only a single input record is added or updated, some queries need to recalculate and update most of the emitted result rows. Obviously, such a query is not suitable to be executed as a continuous query. The following sql is an example query that calculates RANK for each user based on the time of the last click. Once the new rows are received in the clicks table, the user's lastAction is updated and the new ranking must be calculated. However, because the two rows cannot have the same ranking, all rows with lower rankings also need to be updated.

SELECT user, RANK () OVER (ORDER BY lastLogin) FROM (SELECT user, MAX (cTime) AS lastAction FROM clicks GROUP BY user)

Convert table to stream

You can use INSERT, UPDATE, and DELETE to modify dynamic tables just like traditional database tables. When a dynamic table is converted to stream or written to an external system, the changes need to be encoded. Flink's Table API and SQL support three ways to encode changes in dynamic tables.

Append-only stream: if the change operation of the dynamic table is only insert, then changing to stream only needs to send out the inserted row.

Retract stream: a retract flow is a flow that contains two types of messages, add messages and recall messages. The dynamic table is converted into a recycled stream by encoding the INSERT as an increment message, the DELETE as a fallback message, and the UPDATE as a retracement message for the previous row and an addition message for the new row. The following figure shows the transformation from a dynamic table to a recycling stream.

Upsert flow: a upsert flow is a flow that contains two types of messages, upsert messages and delete messages. Dynamic tables that are converted to upsert streams require unique keys. Dynamic tables with unique keys complete the conversion of dynamic tables into streams by encoding INSERT and UPDATE as upsert messages and DELETE as delete messages. The flow operator needs to know the unique key attribute to handle the message correctly. The main difference from the fallback flow is that UPDATE uses a single message to encode the update, so it is more efficient. The following figure shows the transformation from a dynamic table to a upsert stream.

At this point, the study of "the idea of flink dynamic table" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report