How to use HDFS Federation 07/12 Update SLTechnology News&Howtos

How to use HDFS Federation

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces how to use HDFS Federation, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

1. Overview of current HDFS architecture and functionality

Let's review the HDFS functionality first. HDFS actually has two functions: namespace management (Namespace management) and block / storage management service (block/storage management).

1.1 Namespace management

The namespace of HDFS contains directories, files, and blocks. Namespace management: namespaces support basic operations such as file system creation, modification, deletion, list files and directories on directories, files and blocks in HDFS.

1.2 blocks per storage management

There are two parts in the block storage service: block management and physical storage. This is a more general storage service. Other applications can be built directly on Block Storage, such as HBase,Foreign Namespaces and so on.

1.2.1 Block Management

A) handle requests for registration from Data Node to Name Node, membership of datanode, and periodic heartbeats from Data Node.

B) process the report information from the block and maintain the location information of the block.

C) handle block-related operations: block creation, deletion, modification and acquisition of block information.

D) manage replica placement (replica placement) and block replication and deletion of redundant blocks.

1.2.2 physical Stora

The so-called physical storage is that Data Node stores blocks in the local file system and reads and writes to the local file system.

1.3 the architecture of the current HDFS

In the current HDFS architecture (before Hadoop v0.23), there is only one namespace in the entire HDFS cluster, and there is only a single Name Node, and this Name Node is responsible for managing this single namespace. This is the hidden danger of single point of failure (Single Point Failure). The HDFS Federation mentioned in this article is an improvement to the defects in the current HDFS architecture. To put it simply, HDFS Federation is to make HDFS support multiple namespaces and allow multiple Name Node to exist in HDFS at the same time.

Take a brief look at the current architecture of HDFS, as shown in the following figure. There is only one Namenode and one Backup Namenode in the entire HDFS cluster. Namenode synchronizes the information of the changed HDFS to the Backup Namenode in real time. As the name implies, Backup Namenode is used to make backups of Namenode. Namespaces in Namenode store the correspondence between file names and BlockID, and between BlockID and specific Block locations in a hierarchical organization. This separate Namenode manages several Datanode,Block distributed in each Datanode, and each Datanode periodically sends a heartbeat message to this Namenode, reporting the usage status of its own Datanode. Block is the smallest unit used to store data, usually a file will be stored in one or more Block, the default Block size is 64MB.

two。 Limitations of HDFS schemas for a single Namenode 2.1 limitations of Namespace (namespaces)

Because Namenode stores all the metadata (metadata) in memory, the number of objects (files + blocks) that a single Namenode can store is limited by the heap size of the JVM where the Namenode resides. The 50-gigabyte heap can store 2 billion (200 million) objects, and these 2 billion objects support 4000 datanode,12PB storage (assuming the average file size is 40MB).

With the rapid growth of data, the demand for storage also increases. A single datanode grows from 4T to 36T, and the size of the cluster grows to 8000 datanode. The demand for storage has grown from 12PB to greater than 100PB.

2.2 performance bottlenecks

Because it is the HDFS architecture of a single Namenode, the throughput of the entire HDFS file system is limited by the throughput of a single Namenode. There is no doubt that this will become the bottleneck of the next generation of MapReduce.

2.3 isolate issu

Because HDFS has only one Namenode, it is impossible to isolate programs, so one experimental program on HDFS is likely to affect programs running on the entire HDFS. Then in HDFS Federation, different Namespace can be used to isolate different user applications, so that the programs in different Namespace Volume do not affect each other.

2.4 availability of clusters

In a HDFS with only one Namenode, the downtime of this Namenode will undoubtedly make the entire cluster unavailable.

2.5 tight coupling of Namespace and Block Management

The tight coupling of the current combination of Namespace and Block Management in Namenode makes it difficult to implement another Namenode scheme and limits other applications that want to use block storage directly.

2.6 Why is it not feasible to scale up the current Namenode? For example, expand the Heap space of Namenode to 512GB.

The first problem with this vertical scaling is the startup problem, which takes too long to start. Currently, it takes about 30 minutes to 2 hours for a HDFS with 50GB Heap Namenode to start, how long does it take for 512GB?

The second potential problem is that when Namenode is in Full GC, an error will lead to the downtime of the entire cluster.

The third problem is that it is difficult to debug a large JVM Heap. The performance-to-price ratio of optimized Namenode memory usage is relatively low.

3. The main reason why Federation is introduced into Federation is simplicity, which is compared with the real distributed Namenode. Federation can quickly solve most of the problems of single Namenode HDFS. Federation is a simple and robust design, because each Namenode in the alliance is independent of each other. Federation's entire core design implementation took about three and a half months. Most of the changes are in Datanode, Config, and Tools, while Namenode itself has very few changes, so that the original robustness of Namenode will not be affected. It is simpler than a distributed Namenode, and although the scalability of this implementation is smaller than that of a true distributed Namenode, it can meet the requirements quickly. Another reason is the good backward compatibility of Federation, and the existing single Namenode deployment configuration can continue to work without any change.

So Federation is one of the options for the future. The configuration in the current single Namenode architecture can be seamlessly supported in the Federation architecture.

4. HDFS Federation

HDFS Federation uses multiple independent Namenode/namespace to enable HDFS naming services to scale horizontally. The Namenode in HDFS Federation is an alliance, they are independent of each other and do not need to coordinate with each other. Namenode in HDFS Federation provides namespace and block management capabilities. The datanode in HDFS Federation is used as a common storage block by all Namenode. Each datanode registers with all Namenode in its cluster, sends heartbeat and block information reports periodically, and processes instructions from Namenode.

4.1 comparison between Federation HDFS and current HDFS

The current HDFS has only one namespace (Namespace), which uses all blocks. In Federation HDFS, there are multiple separate namespaces (Namespace), and each namespace uses a block pool (block pool).

There is only one set of blocks in the current HDFS. On the other hand, there are several sets of independent blocks in Federation HDFS. A block pool is a group of blocks that belong to the same namespace.

The current HDFS consists of a Namenode and a set of datanode. The Federation HDFS consists of multiple Namenode and a set of datanode, each datanode for multiple block pool (block pool) storage blocks.

4.2 Block Pool (Block Pool)

A Block pool (block pool) is a set of block (blocks) that belong to a single namespace. Each datanode is for all block pool storage blocks. Datanode is a physical concept, while block pool is a logical concept that redivides block. Multiple blocks belonging to multiple block pool can be stored in the same datanode. Block pool allows one namespace to create a Block ID for a new block without notifying other namespaces. At the same time, the failure of a Namenode will not affect the services of the datanode under it for other Namenode.

Block pool is automatically established when datanode establishes contact with Namenode and starts a session. Each block has a unique identity, which we call the extended block ID (Extended BlockID) = BlockID+BlockID. This extended block ID is unique between HDFS clusters, which creates conditions for future cluster merging.

The data structures in Datanode are indexed by block pool ID (BlockPoolID), that is, BlockMap,storage and so on in datanode are indexed by BPID.

In HDFS, all updates and rollbacks take place in Namenode and BlockPool units. That is, there is no relationship between different Namenode/BlockPool in the same HDFS Federation.

In Hadoop V0.23, the management function of Block Pool is still placed in Namenode, and the management function of Block Pool will be moved in a new function node in a future version.

4.3 improvements to Datanode

In datanode, there is a corresponding thread for each Namnode. Every datanode registers with every Namenode and periodically sends heartbeat and datanode usage reports to all Namenode. Datanode also sends Namenode the block report (block report) of the block pool in which it is located. Because there are multiple Namenode at the same time, any Namenode can be dynamically added, deleted, and updated at any time.

4.4 improvements in other aspects of Federation

Provides tools for monitoring and managing initialization and decommissioning of Namenode.

Allow load balancing at datanode level or block pool level.

Datanode background daemon, disk and directory scans for Federation.

Provides a Web UI that displays the usage status of Namenode's Block pool.

A UI presentation of the usage status of all cluster storage is also provided.

All the Namenode and its details are listed in Web UI, such as Namenode-BlockPoolID and storage usage status, lost contact, live and dead block information. There are also links to each Namenode Web UI.

Display of Datanode retirement status.

4.5 Management of multiple namespaces

Whether you need a unique namespace or multiple namespaces in a cluster, the core problem is the sharing and access of data in namespaces. Using globally unique namespaces is one way to address data sharing and access. Under multiple namespaces, we can also use Client Side Mount Table to achieve data sharing and access.

As shown in the figure above, each dark triangle represents a separate namespace, and the light triangle above represents access to the lower subnamespace from the customer's point of view. Each dark namespace Mount to a light-colored table, customers can access different mount points to access different namespaces, just as they can access different mount points in Linux systems. This is the basic principle of namespace management in HDFS Federation: by mounting each namespace into a global mount-table, you can do the data to global sharing; the same namespace is mounted to a personal mount-table, which becomes a view of the namespace visible to the application.

4.6 Namespace Volume (Namespace Volume)

A Namespace and its Block Pool are called Namespace Volume. Namespace Volume is an independent and complete snap-in. When a Namenode/Namespace is deleted, the corresponding Block Pool is also deleted. Each Namespace Volume will also be used as a unit when upgrading.

4.7 ClusterID

Cluster ID is added to HDFS Federation to distinguish each node in the cluster. When formatting a Namenode, the ClusterID is generated automatically or provided manually. This ClusterID is used when formatting other Namenode in the same cluster.

4.8 HDFS Federation is compatible with older versions of HDFS

This compatibility allows existing Namenode configurations to continue to work without any changes.

Thank you for reading this article carefully. I hope the article "how to use HDFS Federation" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.