Deploy Apache Flink Natively on YARN/Kubernetes 07/19 Update SLTechnology News&Howtos

Deploy Apache Flink Natively on YARN/Kubernetes

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Ren Chunde

As the next generation big data computing engine, Apache Flink is developing rapidly and its internal architecture is constantly optimized and reconstructed to adapt to more runtime environments and larger computing scale. Flink Improvement Proposals-6 has redesigned a unified architecture for resource scheduling on various cluster management systems (Standalone/YARN/Kubernetes, etc.). This paper will introduce the architecture development of resource scheduling and its design features such as clear layering. The implementation of per-Job and session modes on YARN, as well as the detailed design of the development being discussed for native integration with the K8S cloud.

The content of this paper is as follows:

Apache Flink Standalone Cluster

Native Fusion of Apache Flink and YARN

Primary Fusion of Apache Flink and K8S

Summary Apache Flink Standalone Cluster

For example, the Standalone cluster deployment of Flink in figure 1 is a master-slave architecture, in which the master JobManager (JM) is responsible for the Task scheduling of the computing unit of the Job, and the TaskManager (TM) reports to the JobManager and is responsible for executing the Task internally with threads.

Cdn.xitu.io/2019/5/6/16a8b08199dfa29c?w=2145&h=916&f=png&s=115499 ">

It is called Standalone because it does not rely on other underlying resource scheduling systems and is directly deployed and launched on their bare machine nodes. Although it can be easily deployed and managed with some automated OPS tools, there are the following problems:

Isolation: multiple Job runs in a cluster, and the Task of different Job may be executed on the same TM. The resources used by the thread (cpu/mem) cannot be controlled and influence each other. Even one Task causes the Out Of Memory of the entire TM, so that the Job above it is affected. The scheduling of multiple Job is also in the same JM, and the problem of being affected by the problematic Job also exists.

Multi-tenant resource usage (quota) management: the total amount of Job resources used by users can not be controlled, and there is a lack of resource coordination management among tenants.

Cluster availability: although JM can deploy Standby and support High Available, JM and TM processes are not cared for, so it is inevitable that too many processes are down due to the above isolation and the whole cluster is unavailable.

Cluster OPS: version upgrade, capacity expansion and other complex OPS operations are required.

In order to solve the above problems, we need to run Flink on popular and mature resource scheduling systems, such as YARN, Kubernetes, Mesos, how to achieve it?

Native Fusion Apache Flink Standalone Cluster on YARN of Flink and YARN

A simple and effective deployment method is to deploy Flink Standalone to a YARN cluster using the features supported by YARN itself, as shown in figure 2 (Apache Flink Standalone Cluster ON YARN)

Multiple Job can set up multiple YARN Application accordingly, and each app is a standalone cluster, which runs independently, and relies on the isolation means such as cgroups supported by YARN itself, which avoids the interaction between multi-tasks and solves the isolation problem.

App of different users can also run in different YARN scheduling queues to solve the problem of multi-tenancy through queue quota management capabilities.

At the same time, we can make use of YARN's strategy of restarting and retrying and rescheduling App processes to make Flink Standalone Cluster highly available.

Simple modification of parameters and configuration files, and distribution of Flink jar through YARN's distributed cache can be easily upgraded and scaled up.

Although the above problems have been solved, one Standalone Cluster for each (small) Job is difficult to achieve efficient resource utilization, because:

The size of Cluster (how many TM) is statically specified when starting YARN App, and the compilation optimization of Flink itself makes it difficult to estimate the demand of resources before running, so it is difficult to rationalize the number of TM, waste more resources, affect the execution speed of Job and even fail to run.

The resource size of each TM is also statically specified, so it is also difficult to estimate the actual needs. You cannot dynamically apply for different sizes of TM according to different Task resource requirements. You can only set the same size of TM, so it is difficult to place integer Task exactly, and the rest of the resources are wasted.

The startup of App (1.Submit YARN App) and the submission of Flink Job (7.Submit Job) need to be completed in two phases, which will make the commit efficiency of each task inefficient and reduce the resource flow rate of the cluster.

The more Flink Job in a large-scale YARN cluster, the more resources will be wasted and the greater the cost loss will be. And it is not only on YARN that has the above problems, but also the same problem that Standalone runs directly on other resource scheduling systems, so Alibaba real-time computing takes the lead in improving Flink's resource utilization model in the actual production experience of YARN, and then discusses with the community to design and implement a set of general architecture. It is suitable for different resource scheduling systems.

FLIP-6-Deployment and Process Model

FLIP-6 fully records the refactoring of the deployment architecture, and the new module is shown in figure 3. Similar to the upgrade from MapReduce-1 architecture to YARN+MapReduce-2, resource scheduling and Job computing logic unit (Task) scheduling are divided into two layers, so that the two modules (system)-ResourceManager (RM) and JobManager (JM) perform their respective functions, reduce the coupling with the underlying resource scheduling system (only need to implement the ResourceManager of different plugable), reduce the logic complexity and reduce the difficulty of development and maintenance, and optimize JM to realize the application of resources according to Task requirements. It not only solves the problem of low resource utilization of Standalone on YARN/K8S, but also helps to expand the scale of cluster and Job.

Dispatcher: responsible for communicating with Client to receive the submission of Job and generate JobManager. The life cycle can be cross-Job.

ResourceManager: connect different resource scheduling systems to achieve resource scheduling (application / release) and manage Container/TaskManager. The same lifecycle can span Job.

JobManager: one instance for each Job, which is responsible for scheduling and executing the computing logic of Job.

TaskManager: report resources to RM registration, receive Task execution from JM and report status. Native Fusion of Apache Flink and YARN

Based on the above architecture, Flink on YARN implements two different deployment operation modes, Per-Job and Session (users use the document Flink on Yarn).

Per-Job

Per-Job, that is, a Flink Job is bound to its YARN Application (App) life cycle. The execution process is shown in figure 4. When submitting the YARN App, the file/jars of the Flink Job is distributed through the YARN Distributed Cache, and the submission is completed at one time. And the JM is based on the actual resource requirements of the Task generated by JobGraph to apply for slot execution from RM, and then Flink RM dynamically applies for / releases the Container of YARN. Perfect (?) It solves all the previous problems, making use of both the isolation of YARN and the efficient utilization of resources.

Session

Per-Job is perfect? No still has limitations. It takes a long time (seconds) to apply for resources and start TM when YARN App is submitted, especially in scenarios such as interactive analysis of short queries, where the execution time of Job computing logic is very short, so a large proportion of App startup time seriously affects the end-to-end user experience and lacks the advantage of fast Job submission in Standalone mode. However, the power of the FLIP-6 architecture can easily solve this problem. As shown in figure 5, a pre-started YARN App is used to run a Flink Session (Master and multiple TM have been started, similar to Standalone can run multiple Job), and then submit to execute Job. These Job can quickly use existing resources to perform calculations. The specific implementation of Blink branch is a little different from that of Master (whether or not to pre-launch TM). Later, it will be merged and unified, and continue to develop the resource elasticity of Session-- automatically expand and shrink the number of TM on demand, which cannot be realized by standalone.

Resource Profile

The front is the change in architecture, and in order to realize the on-demand application of resources, you need to have a protocol API, which is Resource Profile, which can describe the resource usage of CPU & Memory of a single operator (Operator). Then RM applies Container to the underlying resource management system to execute TM according to these resource requests. For more information, please see Task slots and resources.

Native Fusion of Flink and Kubernetes

In recent years, with the rapid development of Kubernetes, it has become the native operating system in the cloud era. Can the deployment and integration of the next generation of big data computing engine Apache Flink open up a new world of big data computing?

Apache Flink Standalone Cluster on Kubernetes

Relying on the powerful ability of K8S to support Service deployment, Flink Standalone Cluster can be easily deployed to K8S cluster through simple K8S: Deployment & Service or Flink Helm chart, but it also has problems such as low resource utilization like Standalone on YARN, so "native fusion" is still needed.

Native Fusion of Apache Flink and Kubernetes

The "native integration" of Flink and K8S is mainly to implement K8SResourceManager on the FLIP-6 architecture to interface with the resource scheduling protocol of Kubernetes. Now the branch implementation architecture of Blink is shown in the following figure. Users are working on merging to the backbone Master in the documentation Flink on K8S.

Summary

Deployment management and resource scheduling are the underlying cornerstones of big data's processing system. Through the abstract layering and refactoring of FLIP-6, Apache Flink has built a solid foundation, which can run "natively" on major resource scheduling systems (YARN/Kubernetes/Mesos), support larger-scale and higher concurrency computing, make efficient use of cluster resources, and provide a reliable guarantee for subsequent continuous development and strength.

The optimization and improvement of related features are still going on, for example, the difficulty of Resource Profile resource configuration scares some developers, and seriously reduces the ease of use of Flink. We are trying to implement resources and concurrent configuration of Auto Config/Scaling and other functions to solve such problems; "Serverless" architecture is developing rapidly, looking forward to the integration of Flink and Kubernetes into a cloud native powerful computing engine (like FaaS), saving resources for users and bringing greater value.

For more information, please visit the Apache Flink Chinese Community website.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.