Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the principle of Flink batch and stream integration?

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article shows you what the principle of Flink batch flow integration is. The content is concise and easy to understand. It will definitely make your eyes shine. I hope you can gain something through the detailed introduction of this article.

There are many technologies to implement batch processing, from SQL processing for various relational databases to MapReduce, Hive, Spark, etc. in the field of big data. These are classic ways of dealing with finite data streams. Flink focuses on infinite stream processing, so how does he do batch processing?

Infinite stream processing: input data has no end; data processing begins at a point in time, either current or past, and continues continuously

Another form of processing is called finite-flow processing, which starts processing data at one point in time and ends at another. The input data may itself be finite (i.e., the input data set does not grow over time), or it may be artificially set to finite for analysis purposes (i.e., events within a certain time period are analyzed).

Obviously, finite-flow processing is a special case of infinite-flow processing, which simply stops at some point in time. In addition, if the calculation results are not generated continuously during execution, but only once at the end, it is batch processing (batch processing data).

Batch processing is a very special case of stream processing. In flow processing, we define sliding or scrolling windows for data and generate results each time the window slides or scrolls. Batch processing is different, we define a global window, all records belong to the same window. For example, the following code represents a simple Flink program that counts visitors to a website every hour and groups them by region.

val counts = visits

.keyBy("region")

.timeWindow(Time.hours(1))

.sum("visits")

If you know that the input data is finite, you can implement batch processing with the following code.

val counts = visits

.keyBy("region")

.window(GlobalWindows.create)

.trigger(EndOfTimeTrigger.create)

.sum("visits")

Flink is unusual in that it can treat data as either infinite or finite streams. Flink's DataSet API is specifically designed for batch processing, as shown below.

val counts = visits

.groupBy("region")

.sum("visits")

If the input data is finite, then the above code will run the same as the previous code, but it is more friendly to programmers who are used to batch processors.

Fink batch model

Flink supports both streaming and batch processing through a single underlying engine

On top of the stream processing engine, Flink has the following mechanisms:

Checkpoint mechanism and stateful mechanism: used to implement fault-tolerant, stateful processing;

Watermark mechanism: used to implement event clock;

Windows and triggers: used to limit the scope of calculations and define when results are presented.

On top of the same stream processing engine, Flink has another mechanism for efficient batch processing.

Backtracking for scheduling and recovery: introduced by Microsoft Dryad and now used in almost all batch processors;

Special memory data structures for hashing and sorting: some data can be spilled from memory to hard disk when needed;

Optimizer: Minimizes the time to produce results.

The two mechanisms correspond to separate APIs (DataStream API and DataSet API); you cannot mix them together to take advantage of all of Flink's capabilities when creating Flink jobs.

In the latest version, Flink supports two relational APIs, Table API and SQL. Both APIs are unified APIs for batch and stream processing, meaning that relational APIs execute queries with the same semantics and produce the same results on unbounded real-time data streams and bounded historical data streams. The Table API and SQL leverage Apache Calcite for query parsing, validation, and optimization. They integrate seamlessly with DataStream and DataSet APIs and support user-defined scalar, aggregate, and table-valued functions.

The Table API / SQL is becoming the primary API for analytical use cases in a consistent manner.

DataStream API is the primary API for data-driven applications and data pipelines.

In the long run, the DataStream API should fully encompass the DataSet API via a bounded data stream.

Flink Batch Performance

Performance comparison of MapReduce, Tez, Spark, and Flink when performing pure batch processing tasks. The batch tasks tested were TeraSort and Distributed Hash Join.

The first task is TeraSort, which measures the time it takes to sort 1TB of data.

TeraSort is essentially a distributed sorting problem consisting of several stages:

(1)Read phase: reading data partition from HDFS file;

(2)local sorting stage: sorting the partition partially;

(3)Shuffling stage: redistributing data to processing nodes according to key;

(4)Final sorting stage: generating sorting output;

(5)Write phase: Write sorted partitions to HDFS file.

The Hadoop distribution includes an implementation for TeraSort, and the same implementation can be used for Tez because Tez can execute programs written through the MapReduce API. TeraSort implementations of Spark and Flink are provided by Dongwon Kim. The cluster used for the measurements consisted of 42 machines, each containing 12 CPU cores, 24GB of memory, and 6 hard disks.

The results show that Flink takes less sorting time than all other systems. MapReduce took 2157 seconds, Tez 1887 seconds, Spark 2171 seconds, and Flink 1480 seconds.

The second task is a distributed hash join between a large dataset (240GB) and a small dataset (256MB). The results show that Flink is still the fastest system, which takes 1/2 and 1/4 the time of Tez and Spark respectively.

The overall reason for the above results is that Flink execution is stream-based, which means there is more overlap between processing stages, and shuffling operations are pipelined, so there are fewer disk access operations. MapReduce, Tez, and Spark, by contrast, are batch-based, meaning that data must be written to disk before it can be transmitted over the network. This test shows that system idle time and disk access operations are less when Flink is used.

It is worth mentioning that the raw values in the performance test results may vary depending on cluster settings, configurations, and software versions.

As a result, Flink can handle unlimited and limited data streams with the same data processing framework without sacrificing performance.

The above content is Flink batch flow integration implementation principle is what, you have learned knowledge or skills? If you want to learn more skills or enrich your knowledge reserves, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report