What is the execution process of Spark on yarn? 04/10 Update SLTechnology News&Howtos

What is the execution process of Spark on yarn?

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "how the Spark on yarn implementation process is". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Many companies dispatch through Yarn, mapreduce on yarn, spark on yarn, and even storm on yarn.

Yarn clusters are divided into two types of nodes:

ResourceManager is responsible for resource scheduling.

NodeManager is responsible for allocating resources and for applications to execute these things.

Submit through Spark-submit script, using yarn-client submit mode, which actually starts the driver program locally

Spark on yarn execution process

The spark program we wrote was packed into a jar package and submitted with spark-submit. A main class in the jar package is started by the command of jvm. The JVM process, this process, is actually our Driver process. After the Driver process starts, execute the main function we wrote ourselves, from new SparkContext ().

Start a driver for us on the client side

Go to ResourceManager to apply for starting container (resources)

Notify a NodeManager to start ApplicationMaster in container

ApplicationMaster went to ResourceManager to apply for Executor.

ResourceManager returns the address of the NodeManager that can be started

ApplicationMaster, go to NodeManager and start Executor.

The Executor process in turn registers with Driver

Finally, after Driver receives the Executor resource, it can execute our spark code.

When an action is executed, a JOB is triggered

DAGScheduler will divide the JOB into Stage

TaskScheduler will divide the Stage into Task

Task is sent to Executor for execution

Driver will dispatch the Task.

Application-Master???

The core concept in yarn is that any job type (mr, spark) to be started on yarn must have one. For each computing framework (mr, spark), if you want to execute your own computing application on yarn, you must implement and provide an ApplicationMaster equivalent to the interface provided by yarn, a class developed by spark itself.

Spark in yarn-client mode, the registration of application (the application of executor) and the scheduling of computing task are separated. In standalone mode, driver is responsible for both operations.

ApplicationMaster (ExecutorLauncher) is responsible for executor application; driver is responsible for the division of job and stage, as well as the creation, allocation and scheduling of task

What kind of problems will arise in yarn-client mode?

Because our driver is started on the local machine, and driver is fully responsible for scheduling all tasks, that is to say, we have to communicate frequently with multiple executor running on the yarn cluster (in the middle, there are startup messages of task, execution statistics messages of task, running status of task, output results of shuffle).

Let's imagine. For example, you have 100 executor, 10 stage, and 1000 task. Each stage runtime has 1000 task submitted to executor to run, with an average of 10 task per executor. The next problem is that driver frequently communicates with the 1000 task running on executor. There are many communication messages, and the frequency of communication is very high. After running one stage, and then running the next stage, there is frequent communication.

Throughout the life cycle of spark operation, communication and scheduling will be carried out frequently. All of this communication and scheduling is sent and received from your local machine. This is the most deadly place. Your local machine is likely to do a lot of network communication within 30 minutes (during the cycle during which the spark job runs). At this point, the network traffic load on your local machine is very high. It will cause the network card traffic of your local machine to surge!

The surge in network card traffic on your local machine is certainly not a good thing. Because in some large companies, the use of each machine is monitored. A single machine will not be allowed to consume a lot of network bandwidth and so on. The operators won't allow it. It may be to the company's network, or otherwise (your machine is still a virtual machine), if you are a virtual machine and share network cards with other machines, it may have negative and adverse effects on other machines and the entire network environment of the company.

The solution:

In fact, the solution is very simple, that is, be clear in mind, under what circumstances can the yarn-client mode be used?

Yarn-client mode, usually we will only use in the test environment, you have written a spark assignment, packed a jar package, on a test machine, use yarn-client mode to submit it. Because the behavior of testing is occasional, a large number of spark jobs will not be submitted continuously for a long time to test. There is also the benefit of yarn-client mode submission, where a detailed and comprehensive log can be observed on the local machine. By looking at log, you can resolve online error reported faults (troubleshooting), observe performance, and tune performance.

In fact, after the line, in the production environment, you have to use yarn-cluster mode to submit your spark homework.

Yarn-cluster mode has nothing to do with the surge in network card traffic caused by your local machine. In other words, even if there is a problem, it should be between the yarn operation and maintenance team and the basic operation and maintenance team. Do they consider whether each machine in the Yarn cluster is a virtual machine or a physical machine? Will the surge of network card traffic have an impact on other things? If the network traffic surges, should we add some network bandwidth to the Yarn cluster and so on? That's what happens between the two teams, not you.

After using yarn-cluster mode

It is not your local machine that runs Driver for task scheduling. In a yarn cluster, a node runs the driver process and is responsible for task scheduling.

This is the end of the content of "what is the implementation process of Spark on yarn". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.