HDFS HA architecture 07/08 Update SLTechnology News&Howtos

HDFS HA architecture

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

HA background

It is a process for each role of HDFS and YARN

For example, the boss of HDFS:NN/SNN/DN is NN.

The boss of YARN:RM/NM is RM.

For all of the above, there will be a single point of failure. If the boss NN or RM fails, then the external service cannot be provided, which will lead to the failure of the whole cluster.

Almost all of big data's construction is a master-slave architecture (master-slave). For example, all read and write requests from hdfs go through the NN node first. (but hbase's read and write requests do not go through the boss's master).

Hdfs: consists of NN/SNN/DN. SNN performs checkpoint operation every hour. If NN dies, it can only be restored to the moment of the last checkpoint, not in real time. Now if you upgrade the role of SNN to the same level as NN, it would be nice if SNN could switch over immediately if NN died.

HDFS HA architecture has two NN nodes, one is active active status, the other is standby preparation status. Active NameNode provides external services, such as processing RPC requests from clients, while Standby NameNode does not provide external services. It only synchronizes the status of Active NameNode and makes real-time backup of Active NameNode, so that it can quickly switch when it fails.

HA introduction

HDFS High Availability (HA)

Assume that:

NN1 active ip1

NN2 standby ip2

If hdfs dfs-ls hdfs://ip1:9000/ is written in our code or in the shell script, then if NN1 dies and NN2 changes to active state, but it is still ip1 in the script, it is impossible to modify it manually at this time. There must be something wrong. So what should we do about it?

Use namespaces to solve the problem. The namespace is not a process. For example, the name of the namespace is: ruozeclusterg7

You can write this in the script: hdfs dfs-ls hdfs://ruozeclusterg7/

When the code reaches this line, it looks in core-site.xml and hdfs-site.xml. In these two configuration files, NN1 and NN2 are hung under the ruozeclusterg7 namespace. When it finds NN1, it will try to connect to the first machine NN1, if it is not in active state, it will try to connect to the second machine NN2. If it finds that NN1 is in active state, it will use it directly.

HA process: (suppose we have three machines now)

Hadoop001:ZK NN ZKFC JN DN

Hadoop002:ZK NN ZKFC JN DN

Hadoop003:ZK JN DN

The NN node has two files, fsimage and editlog (record of read and write requests), which are managed by a special process. This process is a JN (journalnode) log node. The role of JN is needed to ensure that NN1 and NN2 can be synchronized in real time.

If NN1 dies and you need to switch NN2 from standby state to active state, how does it switch? ZKFC is required.

ZKFC: a separate process that monitors the health status of NN and sends a regular heartbeat to the zk cluster so that it can be elected; when it is elected as active by zk, the zkfc process changes the state of the NN node to active through RPC protocol calls. It is imperceptible to provide real-time services.

So above, you need to deploy zookeeper on all three machines. As a cluster, the ZK cluster is used for elections. Elect who will be the boss (active) and who will be standby. The number of ZK in the cluster is 2n+1, so you can vote to ensure that one wins in the end.

Experience with the number of zookeeper deployments in production: if there are 20 nodes in the cluster, you can deploy zk on 5. If there are a total of seven or eight, five zk will also be deployed. If you have a total of 20,100 nodes, you can deploy 7 / 9 / 11 zk. If there are more than 100, 11 zk can be deployed. If there are many, say, tens of thousands, a few more can be deployed depending on the situation. However, this does not mean that the more zk nodes, the better. Because when voting, it takes time to vote who does active and who does standby. Too long time interval will affect external services, and external services will be very slow, which is not allowed for real-time services.

They have a lot of clusters, such as hundreds and thousands, and there is only one process on the machines deployed by zk, and no other processes are deployed. Here there is little learning or few machines, so multiple processes are deployed on one machine. If there are hundreds of nodes and the task is heavy, if there are other processes on the machine where zk is deployed, it will consume a lot of resources on the machine (nothing but cpu, memory, number of files, number of processes), which will affect the speed of zk response, so it is generally independent. But if the machine has 256 gigabytes of memory, but the zk only uses 32 gigabytes of memory, then the rest is wasted, so when you buy a machine, you can buy a machine with 32 gigabytes of memory for zk.

Zk is at the bottom, and if ZK is too busy, it may cause the standby state to be unable to switch to the active state, and the machine may be rammed at this time. So when the machine is rammed and standby can't be switched to active, there may be something wrong with zk.

HDFS HA architecture diagram

Https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html, the official document about the HA architecture

Architecture

In a typical HA cluster, two or more separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the others are in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standbys are simply acting as workers, maintaining enough state to provide a fast failover if necessary.

In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called "JournalNodes" (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of all NameNodes, and send block location information and heartbeats to all.

It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.

Translation:

In a typical HA cluster, NameNode will be configured on two or more independent machines. At any time, one NameNode is active and the other NameNode is in backup state. The active NameNode responds to all clients in the cluster, and the backup NameNode is only used as a copy to ensure a quick transfer if necessary.

To keep Standby Node and Active Node in sync, both Node maintain communication (Journal Nodes) with a set of independent processes called JNS. When namespace is updated on Active Node, it sends a log of changes to the majority of JNS. Standby noes will read these edits from JNS and keep an eye on their changes to the log. Standby Node applies log changes to its own namespace. When failover occurs, Standby will ensure that all edits can be read from JNS before promoting itself to Active, that is, the namespace held by Standy should be fully synchronized with Active before the occurrence of failover.

In order to support fast failover,Standby node, it is necessary to hold the latest location of blocks in the cluster. To do this, you need to configure the addresses of both Namenode on the DataNodes, establish heartbeat links with both of them, and send the block location to them.

At any time, only one Active NameNode is very important, otherwise it will lead to confusion in the operation of the cluster, then the two NameNode will have two different data states, which may lead to data loss or abnormal state, which is often called "split-brain" (brain fissure, three-node communication blocking, that is, different Datanodes in the cluster see two Active NameNodes). For JNS, only one NameNode is allowed as a writer; at any time during the failover, the original Standby Node will take over all functions of the Active and be responsible for writing log records to the JNS, which prevents other NameNode from being in the Active state.

First you need to deploy three zk, then two NN nodes, and then three DN nodes. The editing log between the two NN nodes needs to be maintained by jn for shared data storage.

Journalnode (jn): how much is appropriate for deployment? Depending on the number of HDFS requests and data, such as the amount of data at the BT level, or a large number of small files and frequent read and write requests, then the journalnode is deployed a little more, and if the HDFS is leisurely, then the deployment is less, such as 7 or 9, which can be roughly consistent with the zk deployment (see above). It depends on the actual situation. (also 2n+1, you can see the official online introduction)

ZKFC:zookeeperfailovercontrol

When the client or program code submits, go to namespace to find the NN node. If the NN node you are looking for for the first time is active, then use this node. If you find it is standby, you will go to another machine.

For example, the client now executes put, get, ls and cat commands, and the active NN node will write the records of these operation commands to its own edit log log. NN will write a copy of these operation records. At the same time, it will write these operation records to journalnode's node cluster.

On the other hand, the standby NN node will read journalnode's node cluster in real time and apply these records to itself after reading. The professional term of big data is: replay. It is equivalent to the standby NN node repeating the operation of recording the active status of the active NN node on itself.

Journalnode: it is a cluster that is used to synchronize data between active NN nodes and standby NN nodes. It is a separate process.

NN and ZKFC are on the same machine.

Description of the whole process: when submitting a request through the client, no matter read or write, we find out who is the active status through the namespace RUOZEG6, find it on that machine, submit the request, then there is the read and write flow of HDFS, read and write operation record, edit log, it will write a copy of the operation record of the read and write request, write a copy of the operation record of the read and write request to the journalnode cluster log, after synchronization, another node The standby node will bring it over and apply it to itself in real time. The name of the profession is replay. At the same time, each DataNode sends a block report of the heartbeat to the NameNode node (the heartbeat interval is 3600s, that is, one hour, what are the parameters (interview)). When the active NN node dies, through the zk cluster election (which stores the state of the NN node), the ZKFC is notified to switch the standby NN node to the active state. ZKFC sends heartbeats regularly.

Ps:

HA is designed to solve the problem of single point of failure.

Share the state, that is, the records of hdfs read and write operations through the journalnode cluster.

Elect who is active through the ZKFC cluster.

Monitoring status, automatic backup.

DN: sends heartbeat and block reports to NN1 NN2 at the same time.

ACTIVE NN: records of read and write operations are written to your own editlog

Write a copy to the JN cluster at the same time

Receive heartbeat and block reports from DN

STANDBY NN: receive logs from the JN cluster at the same time, showing that reads perform log operations (repeat) to keep your metadata consistent with the active nn node.

Receive heartbeat and block reports from DN

JounalNode: used for data synchronization of active nn and standby nn nodes, generally deploying 2n+1

ZKFC: a separate process

Monitoring NN monitoring health status

Send a regular heartbeat to the zk cluster so that you can be elected

When zk elects itself to active, the zkfc process changes the state of the NN node to active through RPC protocol calls, only if

Active status is the only way to provide services.

The provision of real-time services to the outside world is imperceptible and imperceptible to users.

Summary

HDFS HA Architecture Diagram taking three machines as an example

HA uses two nodes of active NN,standby NN to solve the single point problem.

Two NN nodes share status through a JN cluster

Elect active through ZKFC, monitor status, and automatically back up.

DN sends a heartbeat to both NN nodes at the same time

Active nn:

Receive the rpc request from client and process it. At the same time, write one of your own editlog, and also write one to editlog on JN's shared storage.

Also receive block report,block location updates and heartbeat of DN

Standby nn:

Will also accept reading and performing these log operations from the editlog of JN, so that the metadata of your NN and activenn are synchronized

Using standby is a hot backup of active nn. As soon as you switch to the active state, you can immediately provide services for the NN role.

Also receive block report,block location updates and heartbeat of DN

Jn:

The synchronous data used for active nn,standby nn is itself a cluster composed of a group of JN nodes, odd number, CDH3 station starts, is to support the Paxos protocol.

Ensure high availability

ZKFC function:

1. To monitor the status of NameNode, ZKFC regularly sends heartbeats to ZK to make itself elected. When it is elected by ZK, our ZKFC process changes nn to active state through rpc call.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.