Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the three Sink modes of Flink Table?

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

What are the three Sink modes of Flink Table? I believe many inexperienced people don't know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

As a computing engine Flink application, the calculation results are always output in some way, such as printing to the console in the debugging phase or writing to the database in the production phase. For applications that already need to store intermediate and final results in Flink memory, such as aggregate statistics, the output is to synchronize the results in memory to the outside. As far as Flink Table/SQL API is concerned, there are three modes of synchronization here, namely, Append, Upsert, and Retract. In fact, these patterns for outputting calculation results are not limited to a particular computing framework, such as Storm, Spark, or Flink DataStream, but Flink Table/SQL has a complete concept and built-in implementation, which is easier to discuss.

Basic principle

I believe that students who have come into contact with Streaming SQL have understood or heard of the duality of flow table. To put it simply, flow and table are different manifestations of the same fact and can be converted into each other. The expression of flow and table is different in the industry, the author likes one: the flow reflects the change of facts in the time dimension, while the table reflects the view of facts at a certain point in time. If the flow is compared to the water flowing in the pipe, the meter will be the still water in the cup.

The method of converting a flow to a table is familiar to most readers, as long as the aggregate statistics function is applied to the flow, and the flow naturally becomes a table. (it is worth noting that the definition of Flink's Dynamic Table and table are slightly different, which will be described below). For example, for a simple stream computing job that calculates PV, the user browsing log data stream url classification statistics is changed into (url, views) such a table. However, readers may not have such a clear idea of how to convert a table into a stream.

Suppose the workflow of a typical real-time stream computing application can be simplified to the following figure:

The key point is whether Transformation aggregates the type of calculation. If not, the output is still a stream, and you can naturally use the original streaming Sink (connector with the external system); if so, the stream will be converted to a table, then the output will be a table, and the output of a table is usually a batch concept and cannot be expressed directly and simply by streaming Sink.

At this time, there is a very simple idea is, can we avoid batch processing the full amount of output, only the diff of the table, that is, changelog at a time. This is also how tables are converted into streams: continuously observe the changes in the table and record each change as a log output. Therefore, the transformation of the flow and the table can be represented by the following figure:

The changes of the table can be divided into three categories: INSERT, UPDATE and DELETE, and Flink summarizes the output modes of the three results according to these types of changes.

Whether the mode INSERTUPDATEDELETEAppend supports it or not. Does not support Upsert support Retract support.

Generally speaking, Append is the easiest to implement but the weakest, while Retract is the most difficult to implement and the most powerful. Below, we will talk about the characteristics and application scenarios of the three modes.

Append output mode

Append is the simplest output mode and only supports the operation of appending result records. Because the result will not change once output, the biggest feature of Append output mode is immutability, and the most desirable advantage of immutability is safety, such as thread safety or Event Sourcing recoverability, but it also brings limitations to business operations. Generally speaking, Append mode is used to write to storage systems that are not convenient for recall or delete operations, such as MQ such as Kafka, or print to the console.

In real-time aggregate statistics, the result output of aggregate statistics is determined by Trigger, while Append-Only means that Trigger can only be triggered once for each window instance (Pane, pane), which makes it impossible to refresh the results when the late data arrives. Generally speaking, we can set a large delay tolerance threshold for Watermark to avoid this refresh (any late data will be discarded), but at the cost of introducing a larger delay.

However, the Append output mode is very useful for Table that does not involve aggregation, because this type of Table simply arranges the records of the data flow together in chronological order, and the calculation between each record is independent. It is worth noting that from the perspective of DataFlow Model, streams that do not aggregate should not be called tables, but in the concept of Flink, all streams can be called Dynamic Table. The author believes that there is some truth in this design, because intercepting a segment from the flow can still satisfy the definition of the table, that is, "view at a certain point in time", and we can argue that non-aggregation is also an aggregation function.

Upsert output mode

Upsert is an upgraded version of Append mode that supports Append-Only operations and UPDATE and DELETE operations with primary keys. The Upsert schema relies on business primary keys to update and delete output, so it is very suitable for KV databases, such as

HBase and JDBC's TableSink all use this approach.

At the bottom, the result updates in Upsert mode are translated into (Boolean, ROW) tuples. The first element represents the operation type, true corresponds to the UPSERT operation (INSERT if it does not exist, UPDATE if it exists), false corresponds to the DELETE operation, and the second element is the record corresponding to the operation. If the result table itself is Append-Only, the first element will be all true, and there is no need to provide a business primary key.

Upsert mode is a practical mode at present, because most businesses provide atomic or composite primary keys, and there are many storage systems that support KV, but it is important to be careful not to change the primary key, which will be discussed in the next section.

Retract output mode

Retract is the most powerful but the most complex of the three output modes. It requires the target storage system to track each record, and these records can be recalled at least for a certain period of time, so it usually comes with its own system primary key and does not have to rely on the business primary key. However, because the big data storage system has few update operations that can be accurate to a record, at least in the native TableSink of Flink, there is no one that can meet this requirement in the production environment.

Unlike Upsert mode, which reoutputs the entire record when it is updated, Retract mode divides the update into two messages that represent the amount of increase or decrease, one is the Retract operation of (false, OldRow) and the other is the Accumulate operation of (true, NewRow). The advantage is that in the event of a change in the primary key, the Upsert output mode cannot recall the record of the old primary key, resulting in inaccurate data, which is not a problem in the Retract schema.

For example, suppose we classify the e-commerce orders according to the carrier express company and have the following result table.

Well, if the original order for Zhongtong express delivery is subsequently updated to use SF delivery, the Upsert model will generate such a changelog (true, (SF 4)), but the order number of Zhongtong has not been revised. In contrast, the Retract model outputs two pieces of data (false, (Zhongtong, 1)) and (true, (Shun Feng, 1)), and the data can be updated correctly.

The essence of the three modes of Flink Table Sink is how to monitor the result table and generate changelog, which can be applied to all scenarios where the table needs to be converted into a stream, including the interaction between different tables of the same Flink application. Among the three modes, Append mode only supports table INSERT, which is the simplest; Upsert mode relies on business primary key to provide all three types of changes: INSERT, UPDATE and DELETE, which is more practical; Retract schema also supports three types of changes and does not require business primary key, but UPDATE will be translated into the withdrawal of old data and the accumulation of new data, which is complicated in implementation.

After reading the above, have you mastered what the three Sink modes of Flink Table are? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report