HA and QJM of HDFS of Hadoop 07/12 Update SLTechnology News&Howtos

HA and QJM of HDFS of Hadoop

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

This article focuses on the HDFS HA feature and how to use the QJM (Quorum Journal Manager) feature to implement HDFS HA.

I. background

There is only one Namenode in the HDFS cluster, which introduces a single point of problem; that is, if the Namenode fails, the cluster will not be available until Namenode is restarted or other Namenode is connected.

There are two ways to affect the overall availability of the cluster:

1. For unexpected emergencies, such as physical machine crash, the cluster will not be available until the administrator restarts Namenode.

2. System maintenance, such as software upgrades, need to shut down Namenode, which will also lead to temporary failure of the cluster.

The HDFS HA feature solves this problem by running two (redundant) Namenodes in the cluster at the same time and hot standby between active and passive. When the Active Namenode fails, you can quickly fail over to the new Namenode (passive Namenode) or, during planned maintenance, based on an administrator-initiated (administrator-inited) friendly failover.

II. Architecture

In a typical HA architecture, there are two independent machines as Namenode. At any time, only one Namenode is in Active state and the other is in standby state (passive, backup); Active Namenode is used to receive Client requests, and Standy nodes keep cluster state data as slave for fast failover.

To keep Standby Node and Active Node in sync, both Node maintain communication (Journal Nodes) with a set of independent processes called JNS. When namespace is updated on Active Node, it sends a log of changes to the majority of JNS. Standby noes will read these edits from JNS and keep an eye on their changes to the log. Standby Node applies log changes to its own namespace, and when failover occurs, Standby will ensure that all edits; can be read from JNS before promoting itself to Active, that is, before failover occurs, the namespace held by Standy should be fully synchronized with Active.

In order to support fast failover,Standby node, it is necessary to hold the latest location of blocks in the cluster. To do this, you need to configure the addresses of both Namenode on the Datanodes, establish heartbeat links with both of them, and send the block location to them.

At any time, only one Active Namenode is very important, otherwise it will lead to confusion in the operation of the cluster, then the two Namenode will have two different data states, which may lead to data loss or abnormal state, which is often called "split-brain" (brain fissure, three-node communication blocking, that is, different Datanodes in the cluster see two Active Namenodes). For JNS (Journal Nodes), only one Namenode is allowed as a writer; at any time during the failover period, the original Standby Node will take over all the functions of the Active and be responsible for writing log records to the JNS, which prevents other Namenode problems based on the Active state.

III. Hardware resources

To build a HA cluster architecture, you need to prepare the following resources:

1. Namenode machines: two peer-to-peer physical machines running Active and Standby Node, respectively.

2. JouralNode machine: the machine that runs JouralNodes. JouralNode daemons are quite lightweight and can be deployed with other processes in hadoop, such as Namenodes, jobTracker, ResourceManager, and so on. However, in order to form a majority (majority), at least three JouralNodes are required, because the edits operation must be written successfully on the majority. Of course, the number of JNS can be more than 3, and it is usually odd (3, 5, 5, 7), which can be more fault-tolerant and form a majority school. If you run N JNS, it allows (NMUI 1) / 2 JNS processes to fail without affecting your work.

In addition, in the HA cluster, standby namenode also checkpoint the namespace (inheriting the features of Backup Namenode), so there is no need to run SecondaryNamenode, CheckpointNode, or BackupNode in the HA cluster. In fact, running the above nodes in the HA architecture will make an error (not allowed).

IV. Deployment

1), configuration

Similar to HDFS Federation, the HA configuration is backward compatible and runs with only one Namenode running without any modification. In the new configuration, it is required that all Nodes in the cluster have the same profile, rather than setting different profiles according to different Node.

Like HDFS Federation, HA clusters reuse "nameservice ID" to identify a HDFS instance (in fact, it may contain multiple HA Namenods); in addition, the concept of "NameNode ID" is added to HA, and each Namenode in the cluster has a different ID;. In order to enable a configuration file to support all Namenodes (applicable to Federation environments), the relevant configuration parameters are suffixed with "nameservice ID" or "Namenode ID".

Modify hdfs-site.xml to add the following configuration parameters, regardless of the order of the parameters.

1. The logical name of dfs.nameservices:nameservice. Can be any readable string; if used in Federation, other nameservices should also be included, separated by ",".

Dfs.nameservices hadoop-ha,hadoop-federation

2 、 dfs.ha.namenodes. [nameservice ID]:

Dfs.ha.namenodes.hadoop-ha nn1,nn2

Where "hadoop-ha" needs to match the nameservice ID configured in 1), here we define that there are two namenode ID under "hadoop-ha".

3 、 dfs.namenode.rpc-address. [nameservice ID]. [namenode ID]

Dfs.namenode.rpc-address.hadoop-ha.nn1 machine1.example.com:8020 dfs.namenode.rpc-address.hadoop-ha.nn2 machine2.example.com:8020

Where nameservice ID needs to match 1) and namenode ID needs to match 2). The value of the configuration item is the hostname of the corresponding namenode and the communication port number (Client and namenode RPC communication ports), which acts like "dfs.namenode.rpc-address" in non-ha mode. Each namenode ID needs to be configured separately.

You can configure "dfs.namenode.servicerpc-address" as needed, and the format is the same as above. (communication address between SNN,backup node and Namenode)

4 、 dfs.namenode.http-address. [nameservice ID]. [namenode ID]

Dfs.namenode.http-address.hadoop-ha.nn1 machine1.example.com:50070 dfs.namenode.http-address.hadoop-ha.nn2 machine2.example.com:50070

The HTTP address of each namenode. It works the same as the "dfs.namenode.http-address" configuration under non-ha.

5 、 dfs.namenode.shared.edits.dir:

Dfs.namenode.shared.edits.dir qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/hadoop-ha

Configure the url address of the JNS group, and Namenodes will read and write edits from the JNS group. This is a shared storage area, Active Namenode writes, Standby Node reads, each Namenodeservice must be configured with enough JNS addresses (> = 3, majority), each in the format: "qjournal://host1:port1;host2:port2;host3:port3/journalId"

Where journalId needs to match "nameserviceID" in the above configuration.

Dfs.journalnode.rpc-address 0.0.0.0:8485 dfs.journalnode.http-address 0.0.0.0:8480

In addition, we need to add the above configuration to the corresponding JournalNodes.

6 、 dfs.client.failover.proxy.provider. [nameservice ID]:

Dfs.client.failover.proxy.provider.hadoop-ha org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

7. Dfs.ha.fencing.methods: a list of scripts or java classes used to isolate Active Namenode during failover.

Although JNS ensures that only one Active Node in the cluster is written to edits, which is important to protect edits consistency, it is possible that Acitive Node is still alive during failover, and Client may also maintain a connection with it to provide old data services, through this configuration, we can specify shell scripts or java programs, SSH to Active NameNode and then Kill Namenode processes. It has two optional values (see the official documentation for details):

1) sshfence:SSH logs in to Active Namenode and Kill this process. First, the current machine can log in to the remote end using SSH, as long as it is authorized (rsa).

2) shell: run the shell instruction to isolate the Active Namenode.

Dfs.ha.fencing.methods shell (/ path/to/my/script.sh arg1 arg2...)

Between "()" is the path to the shell script, as well as a list of parameters.

8. Fs.defaultFS (core-site.xml):

Under non-ha, the parameter value is the address of namenode: "hdfs://namenode:8020"; however, under HA architecture, the namenservice name will be used instead of

Fs.defaultFS hdfs://hadoop-ha

9 、 dfs.journalnode.edits.dir:

Specifies the local path where journalNode stores edits files.

Finally, the above configuration information needs to be configured on both server and client in order to effectively adapt to the characteristics of HA and failover.

II) deployment

After the above configuration is adjusted, we can start the journalNodes daemon, and the default "sbin/start-dfs.sh" script starts journalNodes on the corresponding Datanode according to the "dfs.namenode.shared.edits.dir" configuration. Of course, we can use:: "bin/hdfs start journalnode" to start on the corresponding machine.

Once the JournalNodes starts successfully, they will synchronize metadata from the Namenode.

1. If your HDFS cluster is new, you need to execute the "hdfs namenode-format" instruction on each Namenode.

2. If your namenodes is already format, or if you are converting non-ha to ha architecture, you should copy the metadata on one of the namenode to the other (data in the dfs.namenode.name.dir directory), and then execute "hdfs namenode-bootstrapStandby" on the newly added namenode that does not have format. To run this instruction, you need to make sure that there is enough edits in the JournalNodes.

3. If you switch a non-ha Namenode (such as backup, which has already formated) to HA, you need to first run "hdfs-initializeSharedEdits", which initializes the Journalnodes of the edits in the local Namenode.

After that, you can start HA Namenodes. You can view the status of each Namenode by configuring the specified HTTP address (dfs.namenode.https-address), Active or Standby.

(3), administrator instructions

After the HA cluster is started, we can manage the HDFS cluster through some instructions. "bin/hdfs haadmin-DFSHAAdmin" directive, with optional parameters:

1.-transitionToActive and-transitionToStandbyl: switch the specified namenode ID to Active or standby. This instruction does not trigger "fencing method", so it is not often used, and we usually use "hdfs haadmin-failover" to switch the Namenode state.

2.-failover [--forcefence] [--foreactive]: failover between two Namenode. This directive triggers the failover of the first node to the second node. If first is in standby, then simply promote second to Active. If first is Active, it will be amicably switched to standby. If it fails, fencing methods will trigger until it succeeds, after which second will be promoted to Active. If fencing method fails, second will not be promoted to Active.

For example: "hdfs haadmin-DFSHAAdmin-failover nn1 nn2"

3.-getServiceState: get the status of serviceId, Active or Standby. Link to the specified namenode, get its current status, and print out "standby" or "active". I can use this command in crontab to monitor the status of each Namenode.

4.-checkHealth: check the health status of the specified namenode.

IV), automatic Failover

The above describes how to configure manual failover, in this mode, the system will not automatically trigger failover, that is, will not promote Standby to Active, even if Active has failed. Next, I'll show you how to implement automatic failover.

1), component

In Automatic Failover, two new components have been added: zookeeper cluster, ZKFailoverController process (ZKFC for short).

Zookeeper is a highly available scheduling service that can save a series of scheduling data, notify Client when these data changes (notify), and montitor Clients failure. The implementation of automatic failover will depend on several features of Zookeeper:

1. Failure delection: failure detection. Each Namenode will establish a persistent session with the zookeeper. If the Namenode fails, the session will expire. After that, the Zookeeper will notify another Namenode and trigger the Failover.

2. Active Namenode election:zookeeper provides a simple mechanism to implement Acitve Node election. If the current Active fails, Standby will acquire a specific exclusive lock (lock), then the Node that acquires (holds) the lock will then become Active.

ZKFailoverControllor (ZKFC) is a zookeeper client, which is mainly used to monitor and manage the status of Namenodes. A ZKFC program is run on each Namenode machine, and its duties are as follows:

1. Health monitoring:ZKFC intermittently uses the health-check instruction ping local Namenode,Namenode will also timely feedback their own health status. If Namenode fails, or unhealthy, or no response, then ZKFS will mark it as "unhealthy".

2. Zookeeper session manangement: when the local Nanenode runs well, the ZKFC will hold a zookeeper session. If the local Namenode is Active, it will also hold an "exclusive lock" (znode); this lock is a "ephemeral" znode (temporary node) in zookeeper, and if the session expires, the corresponding znode of the secondary lock will also be deleted. (see zookeeper property)

3. Zookeeper-based election: if the local Namenode runs well and ZKFS does not find that other Namenode holds the lock (for example, after the Active expires, the lock is released), it will try to acquire the lock. If it is successful, that is, it will mark the local Namenode as Active, and then trigger Failover: first, call fencing method, and then promote the local Namenode to Active.

2), configuration

In Automatic Failover, you need to add an important configuration item to the hdfs-site.xml.

Dfs.ha.automatic-failover.enabled true

In addition, the following configuration needs to be added to core-site.xml:

Ha.zookeeper.quorum zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181

The above zookeeper clusters are ready, so choose a relatively stable zk cluster as far as possible.

Where "dfs.ha.automatic-failover.enabled" can be configured separately for each nameservice ID: dfs.ha.automatic-failover.enabled. [nameservice ID]. In addition, parameters related to Zookeeper Client, such as sessionTimeout, can be configured in core-site.xml. These configuration items begin with "ha.zookeeper", where "dfs.ha." The first part of the configuration item can be used to set the relevant control of the fencing method.

3) initialize the HA state

After the above preparations, we also need to initialize the state of HA in zookeeper. By executing "hdfs zkfc-formatZK", this command will create a znode in zookeeker to hold the data of HA or failover.

4) start the cluster

You can use the convenient instruction "start-dfs.sh", which starts all the daemons needed by hdfs, including ZKFC, of course. You can also start the ZKFC client manually using "hadoop-daemon.sh start zkfc".

5) check Failover

Once the Automatic Failover cluster is started, we need to check whether the Failover meets expectations. First of all, we need to check the status of each Namenode through the command (getServiceState) or on the Web UI of Namenode to confirm whether the two Namenode are in Active and Standby; respectively. After that, you can manually turn off Active Namenode, such as using kill-9, and check again whether the original Standby has been promoted to Active after confirming that the Acitve Node has failed. However, because the zookeeper session expiration decision needs to reach sessionTimeout (configurable, ha.zookeeper.session-timeout), this failover process may require a lag of several seconds, which defaults to 5 seconds.

If you don't failover as expected, then you need to check that the configuration file is correct and the zk service is correct. In addition, we can use the above DFSHAAadmin directive to try many times.

V), FAQ

1. Is the startup order of the ZKFC and Namenodes daemons important?

No, for the specified Namenode, you can start ZKFC before or after it. ZKFC only schedules the survival status of the Namenode. If you do not start ZKFC, this Namenode will not be able to participate in the automatic failover process.

2. Do you need additional monitoring?

You need to add additional monitor on the Namenode machine to monitor whether the ZKFC is running. In some cases, a failure of the zookeeper cluster may cause an unexpected outage of ZKFC, and you need to restart ZKFC at the right time. In addition, you need to monitor the health of the Zookeeper cluster, and if the Zookeeper cluster fails, then the HA cluster will not be able to failover.

3. What will happen if Zookeeper fails?

If the zookeeper cluster fails, then the Automatic Failover will not be triggered, even if the Namenode fails, which means that the ZKFC will not function properly. However, if the Namenodes is normal (even if one fails), then the HDFS system will not be affected. Because HDFS Client does not do anything based on zookeeper, the zookeeper cluster still needs to recover as soon as possible to avoid the "split-brain" caused by the current Active failure.

4. Can priority be specified between Namenodes?

NO, this cannot be supported. The Namenode started first will be used as an Active, so we can only think that we can control the order in which the Namenode starts to achieve "priority".

5. What do I do with manual Failover in Automatic Failover?

Like normal Failover, we can always implement manual Failover through "hdfs haadmin-DFSHAAdmin-failover".

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.