How to fully analyze the Failover Mechanism of Spark 07/03 Update SLTechnology News&Howtos

How to fully analyze the Failover Mechanism of Spark

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze the Failover mechanism of Spark? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

The so-called fault tolerance means that a part of the system can continue to provide services when there is an error, and the system performance will not be seriously degraded or paralyzed because of some minor errors.

It is normal for a cluster to have machine failures and network problems, especially after the cluster reaches a large scale, it is likely that frequent machine failures can not provide services, so fault-tolerant design is needed for distributed clusters.

Spark took this situation into account at the beginning of its design, so it can achieve high fault tolerance, which is described below from the exception handling of ExecutorWorker and Master.

Executor exception Spark supports a variety of running modes, in which the cluster manager allocates running resources to the task, starts Executor in the running resource, and Executor is responsible for executing the task, and finally sends the running status of the task to Driver.

The following will analyze the exception of Executor in independent operation mode. Its operation structure is shown in the figure, where the dashed line is the message communication line in normal operation, and the solid line is the exception handling step. 1. First, look at the startup process of Executor: after Master allocates running resources to the application in the cluster, and then starts ExecutorRunner in Worker, while ExecutorRunner starts the CoarseGrainedExecutorBackend process according to the current running mode, when the process sends registration Executor information to Driver, if the registration is successful, CoarseGrainedExecutorBackend starts Executor internally. Executor is managed by ExecutorRunner. When an exception occurs in Executor (such as the container CoarseGrainedExecutorBackend process exiting abnormally, etc.), ExecutorRunner catches the exception and sends an ExecutorStateChanged message to Worker.

2. When Worker receives the ExecutorStateChanged message, in the handleExecutorStateChanged method of Worker, the information is updated according to the Executor status, and the Executor status information is forwarded to Master.

3. After Master receives the Executor status change message, if it finds that Executor exits abnormally, then call the Master.schedule method to try to get the available Worker node and start Executor. This Worker is probably not the Worker node that ran Executor before the failure. The attempt will be made 10 times, and if more than 10 times, mark the failure of the application and remove the application from the cluster. This limit on the number of failures is to prevent the submitted application from committing repeatedly because of the existence of Bug, thus crowding out the valuable resources of the cluster.

The independent operation mode of Worker exception Spark adopts the structure of Master/Slave, in which Slave is performed by Worker. At run time, a heartbeat is sent to Master to let Master know the real-time status of Worker. On the other hand, Master also detects whether the registered Worker times out, because during the operation of the cluster, the Worker process exits abnormally due to machine downtime or process killing. The following is an analysis of how the Spark cluster handles this situation, as shown in the figure. 1. Here you need to know how Master sensed the Worker timeout. When Master receives the Worker heartbeat, the thread that detects the Worker timeout is started in its startup method onStart. The code is as follows: checkForWorkerTimeOutTask = forwardMessageThread. ScheduleAtFixedRate (new Runnable {override def run (): Unit = Utils.tryLogNonFatalError (/ / non-self-sending message CheckForWorkerTimeOut, call the timeOutDeadWorkers method to detect self.send (CheckForWorkerTimeOut)}}, 0, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)

two。 When the Worker times out, the Master calls the timeOutDeadWorkers method to handle the process, which is processed separately according to the Executor and Driver run by the Worker.

If it is an Executor, Master first sends the Executor running on the Worker the message ExecutorUpdated to the corresponding Driver, informing the Executor that it has been lost, and removes these Executor from its application run list. In addition, the relevant Executor exceptions need to be handled in accordance with the previous section.

If it is Driver, determine whether to set the reboot. If necessary, call the Master.schedule method to schedule and assign the appropriate node to restart the Driver;. If no restart is required, delete the application.

Master exception Master is the core of the independent operation mode of Spark. If there is an exception in Master, the operation and resources of the whole cluster will not be managed, and the whole cluster will be "leaderless". Fortunately, Spark takes this situation into account at design time. When the cluster is running, Master will start one or more Standby Master, and when an exception occurs in Master, Standby Master will determine that one of them will take over Master according to certain rules. In stand-alone mode, Spark supports the following policies, which can be set in the configuration file spark-env.sh configuration item spark. Deploy, recovery Mode. The default is NONE.

ZOOKEEPER: the metadata of the cluster is persisted to ZooKeeper. When an exception occurs in Master, ZooKeeper will elect a new Master through the election mechanism. When the new Master takes over, you need to obtain persistence information from ZooKeeper and restore the cluster state based on this information. The specific structure is shown in figure 4-13.

FILESYSTEM: the metadata of the cluster is persisted to the local file system. When an exception occurs in Master, as long as Master is restarted on this machine, the new Master gets the persistence information and restores the cluster state based on this information.

CUSTOM: custom recovery method, which implements the StandaloneRecoveryModeFactory abstract class and configures it to the system. When an exception occurs in Master, the cluster state will be restored in a user-defined way.

NONE: the metadata of the cluster is not persisted. When an exception occurs in Master, the newly launched Master does not restore the cluster state, but takes over the cluster directly.

This is the answer to the question on how to fully parse the Failover mechanism of Spark. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.