How to talk about Flink's runtime with Spark 07/06 Update SLTechnology News&Howtos

How to talk about Flink's runtime with Spark

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to combine Spark to talk about Flink's runtime, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

The Flink runtime has two main roles: JobManager and TaskManager, both of which are started by on yarn, whether it is a standalone cluster. Somewhat similar to the architecture of MRv1, JobManager is mainly responsible for accepting client-side job, scheduling job, coordinating checkpoint, and so on. TaskManager executes the specific Task. In order to isolate resources and increase the number of task allowed, TaskManager introduces the concept of slot. This slot isolates resources only by isolating memory, and the strategy is to share equally. For example, the management memory of taskmanager is 3GB. If there are three slot, then only 1GB memory is available for each slot.

As a rule of thumb, the best default value for taskslot numbers is the number of CPU cores. With hyperthreading, two or more hardware thread contexts are required for each task slot.

The role of Client is mainly to do some preparatory work for job submission, such as building jobgraph to submit to jobmanager, and you can exit immediately after submission. Of course, you can also use client to monitor progress.

Communication between Jobmanager and TaskManager is similar to earlier versions of Spark, using the actor system.

According to the above description, the running architecture diagram is drawn as follows:

What on earth is Task?

At this point, we can first review the three main concepts of Spark:

1. Shuffle

The number of shuffle in Spark task job determines the number of stage.

two。 Zoning

The number of partitions of RDD in the Spark operator determines the parallelism of the stage task.

3. Partition transfer

Don't mention the complicated entry into union,join for the time being. The simple call chain is as follows:

Rdd.map-- > filter-- > reducebykey-- > map.

In the example, it is assumed that rdd has six partitions, and the number of partitions passed from map to fliter remains the same, while the number of partitions from filter to redcuebykey changes. The partition of reducebykey has a default calculation formula, which has been mentioned in Planet. Suppose we pass in a partition number of 12 when we use reducebykey.

The number of partitions, the map is 6, and the map after the reducebykey is 12.

Override def getPartitions: Array [Partition] = firstParent [T] .partitions

Conversions such as map completely inherit the partition and number of partitions of the parent RDD. Parallelism cannot be set artificially by default. Parallelism can only be passed in shuffle.

The main purpose of the above explanation is to lead you to understand the following concepts:

What determines the parallelism of Flink?

What is the task of Flink?

1. What determines the parallelism of Flink?

This is very simple, Flink each operator can set the degree of parallelism, and then you can also set the degree of global parallelism.

Settings for Api

.map (new RollingAdditionMapper ()) .setParallelism (10)

The global configuration is in the flink-conf.yaml file, parallelism.default. The default is 1:

2. What is the task of Flink?

It is supposed to be a parallelism instance of each operator is a subtask- here to distinguish temporarily called substask. Then, it brings a lot of problems, because flink's taskmanager runs task with a separate thread for each task, which brings a lot of thread switching overhead, which in turn affects throughput.

In order to alleviate this situation, flink is optimized, that is, the chain operation is carried out on the subtask, and the task obtained after the chain operation is put into a thread as a scheduling execution unit.

In the following figure, the two operators of source/map are chained; keyby/window/apply is chained, and sink is a separate operator.

Note: the figure assumes that the parallelism of source/map is 2, the parallelism of task is also 2, and the parallelism of WindowBond is 1, there are five parallelism in total, and finally five threads are needed.

According to the understanding at this point, the execution diagram should look like this:

Some friends should say, according to my observation, this is actually not the case.

This is actually another optimization of flink.

By default, flink allows tasks to share slot if the task is a different task, provided, of course, it is within the same job.

As a result, each slot can execute an entire pipeline of the job, as shown in the figure above. The main advantages of this are as follows:

The number of taskslots required for a 1.Flink cluster is the same as the highest parallelism in job. In other words, we no longer need to calculate the total number of task generated by a program.

two。 It is easier to get more full use of resources. If there is no slot sharing, the non-intensive operation source/flatmap will consume as much resources as the intensive operation keyAggregation/sink. If there is slot sharing, increase the parallelism of the baseline from 2 to 6, which can make full use of slot resources, while ensuring that each TaskManager can be evenly allocated to heavy subtasks. For example, keyby/window/apply operations will be evenly distributed among all the slot applied for, so that the load of slot is balanced.

The principle of chaining, that is, under what circumstances will chained operations be performed on task? A brief outline:

The parallelism of upstream and downstream is the same.

The entry degree of the downstream node is 1 (that is, the downstream node has no input from other nodes)

Upstream and downstream nodes are all in the same slot group (slot group will be explained below)

The chain policy of downstream nodes is ALWAYS (can be linked with upstream and downstream, map, flatmap, filter, etc. Default is ALWAYS)

The chain policy of upstream nodes is ALWAYS or HEAD (can only be linked downstream, not upstream. Source defaults to HEAD)

The data partition method between two nodes is forward (see Partition for understanding data flow)

The user did not disable chain

On how to combine Spark to talk about Flink's runtime questions to share here, I hope the above content can be of some help to you, if you still have a lot of doubts to solve, you can follow the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.