How to realize the Integration of batch and flow in Flink 04/26 Update SLTechnology News&Howtos

How to realize the Integration of batch and flow in Flink

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to achieve integrated mass flow in Flink, and the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

There are many technologies to realize batch processing, from sql processing of various relational databases to MapReduce,Hive,Spark in big data's field. These are classic ways to deal with limited data streams. Flink focuses on unlimited flow processing, so how does he do batch processing?

Infinite stream processing: there is no end to input data; data processing starts at a point in time or in the past and goes on and on.

Another form of processing is called finite flow processing, in which data is processed at one point in time and then ends at another. The input data itself may be limited (that is, the input data set does not grow over time), or it may be artificially set to a limited set for analytical purposes (that is, only events within a certain period of time are analyzed).

Obviously, finite flow processing is a special case of infinite flow processing, which just stops at some point in time. In addition, if the result of the calculation is not generated continuously during execution, but only once at the end, it is batch processing (batch data).

Batch processing is a very special case of streaming. In streaming, we define a sliding or scrolling window for the data and generate results each time the window slides or scrolls. Batch processing is different, we define a global window, all records belong to the same window. For example, the following code represents a simple Flink program that counts visitors to a site on an hourly basis and groups by region.

Val counts = visits .keyby ("region") .timeWindow (Time.hours (1)) .sum ("visits")

If you know that the input data is limited, you can implement batch processing with the following code.

Val counts = visits .keyby ("region") .window (GlobalWindows.create) .trigger (EndOfTimeTrigger.create) .sum ("visits")

What is unusual about Flink is that it can treat data as either an infinite stream or a finite stream. Flink's DataSet API is created specifically for batch processing, as shown below.

Val counts = visits .groupBy ("region") .sum ("visits")

If the input data is limited, the above code will run the same as the previous code, but it is more friendly to programmers who are used to batch processors.

Fink batch processing model

Flink supports both streaming and batch processing through an underlying engine

On top of the streaming engine, Flink has the following mechanisms:

Checkpoint mechanism and state mechanism: for fault-tolerant, stateful processing

Watermarking mechanism: used to implement event clock

Windows and triggers: used to limit the scope of the calculation and define the time to render the results.

On top of the same streaming engine, Flink has another set of mechanisms for efficient batch processing.

Backtracking for scheduling and recovery: introduced by Microsoft Dryad and now used in almost all batch processors

Special in-memory data structure for hashing and sorting: you can overflow some of the data from memory to the hard disk when needed

Optimizer: minimize the time it takes to generate results.

The two mechanisms correspond to their respective API (DataStream API and DataSet API); when creating Flink jobs, you cannot take advantage of all the features of Flink at the same time by mixing them together.

In the latest version, Flink supports two relational types of API,Table API and SQL. Both API are unified API for batch and stream processing, which means that on borderless real-time data streams and bounded historical data streams, relational API executes queries with the same semantics and produces the same results. Table API and SQL use Apache Calcite to parse, validate, and optimize queries. They can be seamlessly integrated with DataStream and DataSet API, and support user-defined scalar functions, aggregate functions and table-valued functions.

Table API / SQL is becoming the primary API for analytical use cases in a uniform way.

DataStream API is the primary API for data-driven applications and data pipelines.

In the long run, DataStream API should fully include DataSet API through bounded data streams.

Flink batch performance

Performance comparison of MapReduce, Tez, Spark, and Flink when performing pure batch tasks. The batch tasks tested are TeraSort and distributed hash joins.

The first task is TeraSort, which measures the time it takes to sort 1TB data.

TeraSort is essentially a distributed scheduling problem, which consists of the following stages:

(1) read phase: read data partition from HDFS file

(2) Local sorting phase: partial sorting of the above partitions

(3) mixing phase: redistribute the data to the processing nodes according to key.

(4) final sorting stage: generate sort output

(5) write phase: write the sorted partition to the HDFS file.

The Hadoop distribution contains an implementation of TeraSort, and the same implementation can be used for Tez, because Tez can execute programs written through MapReduce API. The TeraSort implementation of Spark and Flink is provided by Dongwon Kim. The cluster used for the measurement consists of 42 machines, each containing 12 CPU cores, 24GB memory, and 6 hard drives.

The results show that Flink has less sorting time than all other systems. It took 2157 seconds for MapReduce, 1887 seconds for Tez, 2171 seconds for Spark, and 1480 seconds for Flink.

The second task is a distributed hash join between a big data set (240GB) and a small data set (256MB). The results show that Flink is still the fastest system, and its time is 1max 2 and 1max 4 of Tez and Spark, respectively.

The overall reason for the above results is that the execution of Flink is stream-based, which means that there is more overlap between the various processing phases, and the shuffling operations are pipelined, so there are fewer disk access operations. Instead, MapReduce, Tez, and Spark are batch-based, which means that data must be written to disk before it can be transmitted over the network. The test shows that the system has less idle time and less disk access operations when using Flink.

It is worth mentioning that the original values in the performance test results may vary depending on the cluster settings, configuration, and software version.

On how to achieve mass flow integration in Flink to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.