How to solve the task split of database and table based on Flink 07/09 Update SLTechnology News&Howtos

How to solve the task split of database and table based on Flink

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to solve the task splitting of database sub-tables based on Flink. The content of the article is of high quality, so Xiaobian shares it with you for reference. I hope you have a certain understanding of relevant knowledge after reading this article.

1. Scene Description

For example, the order library has been divided into different warehouses and tables, as shown in the following figure:

The requirement now is to synchronize the data to the MQ cluster as soon as you create a task, instead of creating a separate task for each database instance and importing its data to the MQ cluster, because the synchronization tasks are consistent in table structure and data mapping rules except for different databases. 2. Flinkx solution detailed explanation 2.1 fink Stream API development basic process

The general steps for programming with the Flink Stream API are shown below:

Note: Details about Stream API will be expanded in subsequent articles. This article focuses on InputFormatSourceFunction, focusing on splitting data sources.

2.2 flinkx Reader(data source) Core class diagram

In flinkx, different data sources are encapsulated into Readers, whose base class is BaseDataReader. The above figure mainly lists the following key class systems:

InputFormat

Flink core API, mainly for input source data segmentation, reading data abstraction, its core interface description is as follows:

void configure(Configuration parameters)

For additional configuration of the input source, this method needs to be called only once during the lifetime of the Input.

BaseStatistics getStatistics(BaseStatistics cachedStatistics)

Returns statistics of input. If statistics are not needed, null can be returned directly when implemented.

T[] createInputSplits(int minNumSplits)

The input data is sliced to support parallel processing. See InputSplit for the class system related to data slicing.

InputSplitAssigner getInputSplitAssigner(T[] inputSplits)

Get InputSplit allocator, mainly how to get the next InputSplit when executing the task, its declaration is shown in the following figure:

void open(T split)

Open the data channel according to the specified InputSplit. To get a better understanding of this approach, let's look at an example of Flinkx writing jdbc, es:

boolean reachedEnd()

Whether the data is finished or not, usually InputFormat data sources in Flink usually represent bounded data (DataSet).

OT nextRecord(OT reuse)

Gets the next record from the channel.

void close()

Close.

InputSplit

The root interface of data fragmentation defines only the following methods:

int getSplitNumber()

Gets the sequence number of all fragments in which the current fragment resides.

This article begins with a brief introduction to its generic implementation subclass: GenericInputSplit.

int partitionNumber

Number of current split

int totalNumberOfPartitions

Total number of tablets

To facilitate understanding, we can think about the following scenario. For a table with more than ten million data levels, we can consider using 10 threads when performing data segmentation, that is, cutting into 10 points. When querying data, each data thread can read data with id % totalNumberOfPartitions = partitionNumber.

SourceFunction

Abstract definition of Flink source.

RichFunction

Rich functions that define life cycles and capture runtime environment context.

ParallelSourceFunction

Support for parallel source functions.

RichParallelSourceFunction

parallel rich function

InputFormatSourceFunction

The RichParallelSourceFunction implementation class provided by Flink by default can be regarded as a common writing of RichParallelSourceFunction, and its internal data reading logic is implemented by InputFormat.

BaseDataReader

Flinkx data reading base class, in flinkx will be all the data reading source package Reader.

2.3 flinkx Reader Build DataStream Process

After sorting out the above class diagram, we should have a general understanding of the meaning of the above classes mentioned in flink, but how to use it? Next, we'll see how to use it by looking at the readData call flow of flinkx's DistributedJdbcDataReader(a subclass of BaseDataReader).

Basically follow the steps of creating an InputFormat to create a corresponding SourceFunction, and then create a corresponding DataStreamSource from SourceFunction through the addSource method of StreamExecutionEnvironment. 2.4 Flinkx for database sub-table task splitting solution

As described in the scenario at the beginning of this article, an order system is designed as 4 libraries and 8 tables, each library (Schema) contains 2 tables, how to improve the performance of data export, how to improve the performance of data extraction? The usual solutions are as follows:

First of all, split by database and table, that is, 4 databases and 8 tables, which can be split into 8 copies, and each data allocation processes 1 table in an instance.

The data of a single table is extracted and then split, for example, modulo by ID for further decomposition.

Flinkx is to take the above strategy, let's take a look at its specific approach.

Step1: First, split the database instance and table, organize them into a DataSource list by table dimension, and then execute the splitting algorithm based on the original data.

Next, the specific task splitting is implemented in InputFormat. This example is in createInputSplitsInternal of DistributedJdbcInputFormat.

DistributedJdbcInputFormat#createInputSplitsInterna l

Step 2: Create an inputSplit array based on partitions, where the concept of partitions is equivalent to the first item in the scheme mentioned above. DistributedJdbcInputFormat#createInputSplitsInternal

Step3： If the task splitting algorithm of splitKey is specified, DistributedJdbcInputSplit is inherited from GenericInputSplit, and the total number of partitions is numPartitions, and then the parameters of the database are generated. Here, the splitKey mod totalNumberOfPartitions = partitionNumber in the SQL Where statement is mainly generated, where splitKey is the fragment key, such as id, and totalNumberOfPartitions represents the total number of partitions, partitionNumber represents the serial number of the current fragment, and the data is split through the SQL modulo function. DistributedJdbcInputFormat#createInputSplitsInternal

Step4: If no table-level data split key is specified, the split strategy is to split the sourceList, i.e., some partitions handle several of the tables.

So much for the task splitting in flinkx.

3. Summary

This paper mainly introduces how to divide tasks based on flink in MySQL database and table based on flinkx, and briefly introduces the basic class system of Flink about basic programming paradigm, InputFormat and SourceFunction.

About how to solve the database sub-table task splitting based on Flink, I hope the above content can be of some help to everyone and learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.