What is the principle of Namenode HA 07/08 Update SLTechnology News&Howtos

What is the principle of Namenode HA

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

What is the principle of Namenode HA, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Detailed explanation of Namenode HA principle

The community hadoop2.2.0 release version begins to support NameNode's HA, and this article describes the design and implementation of NameNode HA in detail.

Why Namenode HA?

1. NameNode High Availability is highly available.

2. NameNode is very important. If you hang up, the storage will stop service, the data cannot be read or written, and the calculation based on this NameNode (MR,Hive, etc.) cannot be completed.

How to implement Namenode HA and what are the key technical problems?

1. How to keep the status of the master and slave NameNode synchronized, and how to make Standby provide services quickly after the Active is hung up? it is time-consuming to start namenode, including loading fsimage and editlog (obtaining file to block information), processing all datanode's first blockreport (obtaining block to datanode information), and keeping the status of NN synchronized.

two。 Split-brain means that in a high availability (HA) system, when two connected nodes are disconnected, the whole system is split into two independent nodes, and the two nodes begin to compete for shared resources, resulting in system confusion and data corruption.

3. NameNode switching is transparent to the outside world. When the master Namenode switches to another machine, it should not cause the connecting client to fail, including the link between Client,Datanode and NameNode.

What problems have been solved by the HA architecture, implementation principle, implementation mechanism of each part of the community NN?

1. In a non-HA Namenode architecture, there is only one NN,DN in a HDFS cluster and only one NN is reported, and the editlog of NN is stored in the local directory.

two。 The architecture of community NN HA

Figure 1 Magi NN HA architecture (copied from the community)

The NN HA of the community includes two NN, active and standby, and ZKFC,ZK,share editlog. Process: after the cluster is started, a NN is in the active state and provides services, processes requests from clients and datanode, and writes editlog to local and share editlog (which can be NFS,QJM, etc.). The other NN is in the Standby state. It loads the fsimage when it starts, and then periodically fetches the editlog from the share editlog to keep the state of the active in sync. In order to enable standby to provide services quickly after sctive hangs, DN needs to report to both NN at the same time, so that Stadnby can save block to datanode information, because the most time-consuming task in NN startup is to process all datanode's blockreport. In order to achieve hot backup, add FailoverController and ZK,FailoverController to communicate with ZK, select master through ZK, and FailoverController converts NN to active or standby through RPC.

two。 Key questions:

(1) keep the status of NN synchronized, obtain editlog,DN periodically through standby and send blockreport to standby.

(2) prevent cerebral fissure

Fencing of shared storage to ensure that only one NN can be written successfully. Using QJM to implement fencing, the principle is described below.

Fencing for datanode. Make sure that only one NN can command DN. How DN implements fencing is described in detail in HDFS-1972

(a) when each NN changes its state, it sends its own status and a serial number to DN.

(B) DN maintains this serial number during operation, and when failover, the new NN returns its active status and a larger serial number when it returns the DN heartbeat. DN receives this return because it considers the NN to be the new active.

(C) if the original active (such as GC) is restored and the heartbeat information returned to the DN contains the active status and the original serial number, the DN will reject the NN command.

(d) it is particularly important to note that the above implementation is not perfect and that some hidden dangers that may lead to the erroneous deletion of block have been addressed in HDFS-1972. After failover, active should not delete any block before DN reports all deletion reports.

Client fencing to ensure that only one NN can respond to client requests. Let the client accessing the standby nn fail directly. There is a layer encapsulated in the RPC layer, and the NN is connected by FailoverProxyProvider in a retry way. After several failed attempts to connect to a NN, the effect on the client is to increase the delay when retrying. The client can set the time and time to retry.

The design of ZKFC

1. FailoverController implements the following functions

(a) Monitoring the health status of NN

(B) send a regular heartbeat to ZK so that he or she can be elected.

two。 Why separate from NN as a deamon process

(1) prevent the heartbeat from being affected because of the GC failure of NN.

(2) the code of FailoverController function should be separated from the application to improve fault tolerance.

(3) make the active and standby election a pluggable plug-in.

Figure 2 FailoverController architecture (copied from the community)

3. FailoverController mainly consists of three components

(1) HealthMonitor monitors whether NameNode is in unavailable or unhealthy state. This is currently done by calling the corresponding method of NN through RPC.

(2) ActiveStandbyElector manages and monitors its status in ZK.

(3) ZKFailoverController it subscribes to HealthMonitor and ActiveStandbyElector events and manages the state of NameNode.

The design of QJM

Namenode records metadata such as directory files of HDFS. Every time the client adds, deletes or modifies files, Namenode records a log called editlog, and the metadata is stored in fsimage. In order to keep the state of Stadnby consistent with that of active, standby needs to obtain each editlog log in real time as much as possible and apply it to FsImage. At this time, a shared storage is needed, and the editlog,standby can get the log in real time. There are two key points to ensure that shared storage is highly available and it is necessary to prevent data corruption caused by two NameNode writing data to shared storage at the same time.

What is it, Qurom Journal Manager, based on Paxos (message passing-based consistency algorithm). This algorithm is difficult to understand, to put it simply, the Paxos algorithm is to solve how to agree on a certain value in a distributed environment. (a typical scenario is that in a distributed database system, if the initial state of each node is the same and each node performs the same sequence of operations, then they can finally get a consistent state. To ensure that each node executes the same sequence of commands, a "consistency algorithm" needs to be performed on each instruction to ensure that the instructions seen by each node are consistent)

Figure 3 QJM architecture

How to realize

(1) after initialization, Active writes the editlog log to JN on 2N+1, and each editlog has a number. Each time editlog is written, it will be recognized as successful as long as most of the JN returns success (that is, greater than or equal to JN 1).

(2) Standby periodically reads a batch of editlog from JN and applies it to FsImage in memory.

(3) how to fencing: every time NameNode writes Editlog, it needs to pass a number Epoch to JN,JN to compare with Epoch. If it is larger or the same than the Epoch saved by yourself, you can write, and JN updates its Epoch to the latest, otherwise it refuses the operation. During the switch, when Standby is converted to Active, the Epoch+1 is converted, which prevents the previous NameNode from failing even if it writes logs to JN.

(4) write a log:

(a) NN writes Editlog asynchronously to N JN via RPC. If one is successful, the write is successful.

(B) the JN that fails to write will not be written again until the scrolling log operation is called. If the JN returns to normal at this time, log will continue to be written to it.

(C) each editlog has a number txid,NN to write the log to ensure that the txid is continuous. When receiving the write log, JN will check whether the txid is continuous with the last time, otherwise the write fails.

(5) read the log:

(a) iterate through all the JN periodically to get the undigested editlog, sorted by txid.

(B) digest editlog according to txid.

(6) Log recovery mechanism when switching

(a) triggered when master-slave switching occurs

(B) to prepare for prepareRecovery, standby sends a RPC request to JN, obtains txid information, and selects the best JN.

(d) Finalized log. That is, when the current editlog output stream is closed or when the log is scrolled.

(e) Standby synchronizes editlog to the latest

7) how to choose the best JN

(a) those with Finalized do not need in-progress

(B) the need for multiple Finalized to determine whether the txid is equal

(d) if Epoch is the same, choose the one with large txid.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.