Analysis of HDFS Architecture and Design 07/01 Update SLTechnology News&Howtos

Analysis of HDFS Architecture and Design

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author | Dazun

Hdfs is the distributed file system of hadoop, or Hadoop Distributed Filesystem. The following is mainly about the more important points in the design of HDFS, so that readers can get a glimpse of the whole picture of HDFS through a short article, which is suitable for beginners who have a little understanding of HDFS but are confused about HDFS. This article mainly refers to the official documentation of hadoop 3.0.

Link: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

When the size of a dataset exceeds the capacity of a physical machine, it needs to be partitioned and stored on several different independent computers, where a file system that manages storage across multiple computers is called a distributed file system.

Catalogue

Scenarios using HDFS

The working mode of HDFS

File system Namespace (namespace)

Data replication

Persistence of file system metadata

Communication protocol

Robustness

Data organization

Accessibility

Storage space recycling 1. Scenarios using HDFS

HDFS is suitable for storing very large files in streaming data access mode. That is, write once, read many times, and do all kinds of analysis on the dataset for a long time, each analysis involves most or even all of the data in the dataset. For very large files, hadoop currently supports the storage of PB-level data.

HDFS is not suitable for applications that require low-latency data access, because HDFS is optimized for applications with high data throughput, which can cost a lot of time delay.

The total number of files that a HDFS file system can store is limited by the memory capacity of namenode. As a rule of thumb, 100 million files, and each file occupies one block, requires at least the memory of 300MB.

Currently, hadoop files may have only one writer, and write operations always add data to the end of the file, and modifications are not supported anywhere in the file.

Compared with ordinary file system data blocks, HDFS also has the concept of blocks. By default, 128MB files on HDFS are also divided into multiple blocks of block size as independent storage units, but files smaller than one block size in HDFS will not occupy the space of the whole block. If not specified, the blocks mentioned in this article refer specifically to the blocks of HDFS.

The purpose of why HDFS blocks are so large is to minimize addressing overhead. This number cannot be set too high. Map tasks in mapreduce usually process data in only one block at a time, so if the number of tasks is too small, the job will run slowly.

2. The working mode of HDFS

HDFS uses the master/slave architecture, that is, one namenode (manager) and multiple datanode (workers).

Namenode is responsible for managing the namespaces of the file system. Maintains the file system tree and all files and directories throughout the tree, and this information is stored in two files, the namespace image file and the edit log file. Namenode also records the data node information where each block in each file is located. Datanode are the working nodes of the file system that need to store and retrieve blocks (scheduled by the client or namenode) and periodically send a list of blocks they store to namenode.

Without namenode, the file system would not work because we don't know how to rebuild files based on the blocks of datanode, so fault tolerance for namenode is very important. Hadoop provides two mechanisms for this.

The first mechanism is to back up the files that make up the persistent state of the file system metadata. Typically, while the persistent file is written to the local disk, it is written to the remotely mounted NFS.

The second way is to run a secondary namenode that periodically merges the namespace image by editing the log and saves a copy of the merged namespace image locally, enabling it in the event of a namenode failure. However, when the primary node fails, it is inevitable to lose some data, so the namenode metadata stored in NFS can be copied to the auxiliary namenode to run as a new namenode. This involves the mechanism of failover. I'll do a little analysis later.

3. File system Namespace (namespace)

HDFS supports the traditional hierarchical file organization structure. The user or application can create directories and save files in those directories.

The hierarchy of file system namespaces is similar to that of most existing file systems: users can create, delete, move, or rename files. HDFS supports user disk quotas and access control, and currently does not support hard and soft links. But the HDFS architecture does not hinder the implementation of these features.

Namenode is responsible for maintaining the namespace of the file system, and any changes to the file system namespace or attributes will be recorded by Namenode. The application can set the number of copies of files saved by HDFS. The number of copies of a file is called the copy coefficient of the file, and this information is also saved by Namenode.

4. Data replication

HDFS is designed to store very large files reliably across machines in a large cluster. It stores each file as a series of blocks, all of which are of the same size except the last one.

For fault tolerance, all blocks of the file will have copies. The block size and replica factor for each file are configurable. An application can specify the number of copies of a file. The copy factor can be specified when the file is created or can be changed later.

Files in HDFS are written at once, and it is strictly required that there can only be one writer at any one time.

Namenode oversees block replication and periodically receives heartbeats and block status reports (Blockreport) from each Datanode in the cluster. When a Datanode starts, it scans the local file system, produces a list of all HDFS blocks corresponding to these local files, and then sends it to Namenode as a report, which is the block status report. Receiving a heartbeat signal means that the Datanode node is working properly. The block status report contains a list of all the blocks on the Datanode.

Get the list of blocks. Check the health status of the data block: hdfs fsck /-files-block or hdfs fsck /

Cdn.xitu.io/2019/7/18/16c03237f7f9c9bb?w=1080&h=400&f=png&s=224708 ">

The blocks of HDFS are stored in a file with the prefix _ blk, and each block also has an associated metadata file with a .meta suffix, which includes a series of checksums for the header and sections of the block.

When the number of blocks increases to a certain size, datanode creates a subdirectory to hold new blocks and metadata information. If the current directory already stores 64 blocks (set by the dfs.datanode.numlocks attribute), create a subdirectory with the ultimate goal of designing a high fan-out directory tree.

If the dfs.datanode.data.dir attribute specifies multiple directories on different disks, the blocks are round-robin written to each directory.

A block scanner is also run on each datanode to periodically detect all blocks on this node so that bad blocks can be detected and repaired in time before the client reads them. By default, the state of the block is tested every 3 weeks and possible failures are fixed.

Users can obtain the block detection report of the datanode through http://datanode:50070/blockScannerReport.

Copy storage

Replica storage is the key to the reliability and performance of HDFS. Optimized replica storage strategy is an important feature that distinguishes HDFS from most other distributed file systems. This feature requires a lot of tuning and experience. HDFS uses a strategy called rack awareness (rack-aware) to improve data reliability, availability, and network bandwidth utilization. The currently implemented replica storage strategy is only the first step in this direction.

Through a rack-aware process, Namenode can determine the rack id to which each Datanode belongs. A simple but unoptimized strategy is to store copies on different racks. This effectively prevents the loss of data when the entire rack fails, and allows you to make full use of the bandwidth of multiple racks when reading data. This policy setting can distribute replicas evenly in the cluster, which is beneficial to load balancing in the case of component failure. However, because one write operation of this strategy requires the transfer of data blocks to multiple racks, this increases the cost of writing.

In most cases, the replica factor is 3. The storage strategy of HDFS is to store one copy on a node in the local rack, one copy on another node on the same rack, and the last copy on a different rack node. This strategy reduces data transfer between racks, which improves the efficiency of write operations.

In reality, in hadoop2.0, there are two ways to choose the disk selection strategy for storing datanode data copies:

The first is to use hadoop1.0 's disk directory polling method to implement the class:

RoundRobinVolumeChoosingPolicy.java

The second is to choose disk storage with enough free space to implement the class: AvailableSpaceVolumeChoosingPolicy.java

The configuration items corresponding to the second policy are:

If it is not configured, the first method is used by default, that is, polling selected disks to store copies of the data, but although polling can ensure that all disks can be used, there is often a problem of uneven direct data storage on each disk, some disks are full, and some disks may have a lot of storage space unutilized, all in the hadoop2.0 cluster. It is best to configure the disk selection strategy as the second, and select disk storage data copies according to the amount of disk space left, which can also ensure that all disks can be utilized and balanced.

When using the second method, there are two other parameters that will be used:

The default value is 10737418240, which is 10G, and the default value is generally used. The official interpretation is that, first, two values are calculated, one is the maximum available space of all disks, and the other is the minimum available space of all disks. If the difference between the two values is less than the threshold specified by the configuration item, the disk storage data copy is selected using the polling disk selection policy.

The default value is 0.75f, and the default value is generally used. The official explanation is what percentage of copies of data should be stored on disks with enough space left. The value range of this configuration item is 0.0-1.0, usually 0.5-1.0. If the configuration is too small, the disks with enough remaining space will not actually be allocated enough copies of data, while the disks with insufficient remaining space will need to store more copies of data, resulting in uneven storage of disk data.

Replica selection

To reduce overall bandwidth consumption and read latency, HDFS tries to get the reader to read the nearest copy. If there is a copy on the same rack of the reader, read the copy. If a HDFS cluster spans multiple data centers, the client will also read the copy of the local data center first.

Safety mode

After Namenode starts, it enters a special state called safe mode. Namenode in safe mode does not copy blocks. Namenode receives heartbeats and block status reports from all Datanode. The block status report includes a list of blocks owned by a Datanode. Each data block has a specified minimum number of copies.

When Namenode detection confirms that the number of copies of a data block reaches this minimum value (the minimum value is 1 by default, which is set by the dfs.namenode.replication.min attribute), then the data block is considered to be copy secure (safely replicated). After a certain percentage of data blocks (this parameter is configurable, the default is 0.999f, and the attribute value is dfs.safemode.threshold.pct) are confirmed to be safe by Namenode detection (plus an additional 30-second wait time), Namenode will exit the safe mode state. It then determines which blocks have less than the specified number of copies and copies those blocks to other Datanode.

If datanode loses a certain percentage of its block, namenode will always be in safe mode, that is, read-only mode.

What should I do when namenode is in safe mode?

Find the problem and fix it (such as fixing a down datanode).

Or you can manually force out of safe mode (without really solving the problem): hdfs namenode-- safemode leave.

When the hdfs cluster is normally cold started, the namenode will remain in the safemode state for a long time. At this time, you don't need to pay attention to it and wait for it to exit the security mode automatically.

You can operate the security mode through dfsadmin-safemode value. The parameter value is described as follows:

Enter-enter safe mode

Leave-forces NameNode to leave safe mode

Get-returns information about whether safe mode is enabled or not

Wait-wait to exit safe mode before executing a command.

5. Persistence of file system metadata

The namespace of HDFS is saved on Namenode. Namenode records any changes to the file system metadata using a transaction log called EditLog. For example, if you create a file in HDFS, Namenode will insert a record in Editlog to represent it; similarly, changing the copy factor of the file will also insert a record into Editlog. Namenode stores this Editlog in the file system of the local operating system.

The namespace of the entire file system, including block-to-file mapping, file attributes, and so on, is stored in a file called FsImage, which is also placed on the local file system where Namenode resides.

Namenode holds the namespace of the entire file system and the image of the file block mapping (Blockmap), or FsImage, in memory. This key metadata structure is so compact that an Namenode with 4G memory is sufficient to support a large number of files and directories.

When Namenode starts, or the checkpoint reaches the threshold in the configuration file, it reads Editlog and FsImage from the hard disk, acts all transactions in Editlog on the in-memory FsImage, saves this new version of FsImage from memory to the local disk, and then deletes the old Editlog, because the transactions of the old Editlog are already working on the FsImage. This process is called a checkpoint.

Hdfs dfsadmin-fetchImage fsimage.backup

/ / obtain the latest fsimage file manually from namenode and save it as a local file.

Because the editing log will grow indefinitely, the process of resuming the editing log will be longer, and the solution is to run the secondary namenode to create a checkpoint of the file system metadata in the primary namenode memory. Eventually the main namenode has the latest fsimage file and a smaller edits file.

This also explains why the secondary namenode and the primary namenode have similar memory requirements (the secondary namenode also needs to load the fsimage file into memory).

The trigger condition for creating a checkpoint is controlled by two configuration parameters

Dfs.namenode.checkpoint.period attribute (the secondary namenode creates checkpoints at regular intervals, in s). Dfs.namenode.checkpoint.txns, if you edit the number of transactions with log size from the previous checkpoint, create a checkpoint.

In the event of a failure of the primary NameNode (assuming no backup), data can be recovered from the secondary namenode. There are two ways to do this.

Method 1, copy the relevant storage directory to the new namenode.

Second, start the namenode daemon with the-importCheckpoint option, thus using the secondary namenode as the new primary namenode, provided that there is no metadata in the directory defined by the dfs.namenode.dir attribute.

6. Communication protocol

All HDFS communication protocols are based on the TCP/IP protocol. The client connects to Namenode through a configurable TCP port and interacts with Namenode through the ClientProtocol protocol. Datanode uses the DatanodeProtocol protocol to interact with Namenode.

A remote procedure call (RPC) model is abstracted to encapsulate ClientProtocol and Datanodeprotocol protocols. By design, Namenode does not initiate RPC proactively, but responds to RPC requests from the client or Datanode.

7. Robustness

The main goal of HDFS is to ensure the reliability of data storage even in the event of errors.

Three common types of errors are: Namenode error, Datanode error, and network fragmentation (network partitions).

Heartbeat detection, disk data errors and re-replication.

Each Datanode node periodically sends a heartbeat signal to the Namenode. Network fragmentation may cause some Datanode to lose contact with Namenode. Namenode detects this through the absence of heartbeats and marks these recently unsent heartbeats Datanode as downtime and will not send new IO requests to them. Any data stored on the downtime Datanode will no longer be valid.

The downtime of Datanode may cause the replica coefficient of some data blocks to be lower than the specified value. Namenode constantly detects these blocks that need to be replicated, and starts the replication operation as soon as it is found.

Set the appropriate datanode heartbeat timeout to avoid replication storms caused by datanode instability.

You may also need to re-copy in the following cases: a Datanode node fails, a copy is damaged, a hard disk error on the Datanode, or the copy factor of the file increases.

Cluster balancing (for datanode)

The architecture of HDFS supports data equalization strategies. If the free space on a Datanode node is lower than a specific critical point, the system will automatically move data from this Datanode to other free Datanode according to the equalization strategy.

If there is a sudden increase in the number of requests for files, it is also possible to start a plan to create a new copy of the file while rebalancing other data in the cluster. This equilibrium strategy has not been implemented yet.

Data integrity (for datanode)

Data blocks obtained from a Datanode may be corrupted, which may be caused by Datanode storage device errors, network errors, or software bug. HDFS client software implements the checksum (checksum) check of the contents of HDFS files.

When the client creates a new HDFS file, it calculates the checksum for each block of data in the file and saves the checksum as a separate hidden file in the same HDFS namespace. When the client acquires the contents of the file, it verifies that the data obtained from the Datanode matches the checksum in the corresponding checksum file, and if not, the client can choose to obtain a copy of the block from another Datanode.

Metadata disk error (error for namenode)

FsImage and Editlog are the core data structures of HDFS. If these files are corrupted, the entire HDFS instance will be invalidated. Therefore, Namenode can be configured to support the maintenance of multiple copies of FsImage and Editlog. Any changes to FsImage or Editlog will be synchronized to their copies. This multi-copy synchronization operation may reduce the number of namespace transactions processed by Namenode per second. However, this price is acceptable, because even if HDFS applications are data-intensive, they are not metadata-intensive. When Namenode restarts, it selects the most recent full FsImage and Editlog to use.

Another option is to enhance fault resilience by implementing multiple namenode nodes (HA) through shared storage of NFS or a distributed editing log (also known as journal).

In the implementation of HDFS HA, a pair of active-standby 's namenode is configured, and when the active namenode fails, the standby namenode takes over its task and begins to serve the client's request.

Implementing HA requires the following architectural changes:

The namenode shares the editing log through highly available shared storage. When the standby namenode takes over, it reads the shared editing log through to the end, synchronizes with the active namenode state, and continues to read new entries written by the active namenode.

Datanode needs to send a block processing report to both namenode because the block mapping information exists in the memory of the namenode, not the hard disk.

The client uses a specific mechanism to handle the invalidation of namenode, which is transparent to the user.

The role of the secondary namenode is contained by the namenode, and the standby namenode sets periodic checks for the active namenode namespace.

Snapshot

Snapshots support replication and backup of data at a particular time. Snapshots allow HDFS to recover to a known correct point in time in the event of data corruption.

8. Data organization

Data block

HDFS is designed to support large files, and HDFS is suitable for applications that need to deal with large data sets. These applications only write data once, but read it one or more times, and the reading speed should meet the needs of streaming reading. HDFS supports the "write once, read multiple" semantics of the file. A typical block size is 128MB. As a result, files in HDFS are always split into different blocks according to 128m, each stored in a different Datanode as much as possible.

Pipelined replication

When the client writes data to the HDFS file, it is written to the local temporary file at first. Assuming that the copy factor of the file is set to 3, when the local temporary file accumulates to the size of a data block, the client will obtain a Datanode list from Namenode to hold the copy. Then the client begins to transmit data to the first Datanode, and the first Datanode receives the data in a small part (4 KB), writes each part to the local warehouse, and transmits the part to the second Datanode node in the list. The same is true of the second Datanode, where a small portion of the data is received, written to the local warehouse, and passed to the third Datanode at the same time. Finally, the third Datanode receives the data and stores it locally.

Therefore, Datanode can pipelessly receive data from the previous node and forward it to the next node at the same time, and the data can be pipelined from the previous Datanode to the next.

9. Accessibility

HDFS provides a variety of access methods for applications. Users can access it through the Java API interface, or through the encapsulated API of the C language, and can also access the files in HDFS through the browser. Access through the WebDAV protocol is under development.

DFSShell

HDFS organizes user data in the form of files and directories. It provides a command line interface (DFSShell) for users to interact with the data in HDFS. The syntax of the command is similar to other shell tools that users are familiar with (such as bash, csh). Here are some examples of actions / commands:

DFSAdmin

The DFSAdmin command is used to manage the HDFS cluster. These commands are available only to administrators of HDSF. Here are some examples of actions / commands:

Browser interface

A typical HDFS installation opens a Web server to expose the HDFS namespace on a configurable TCP port. Users can use a browser to browse the HDFS namespace and view the contents of the file.

Http://ip:50070

10. Storage space recycling

Deletion and recovery of files

When garbage collection takes effect, files deleted by fs shell are not immediately deleted from HDFS. In fact, HDFS will rename the file and transfer it to the user//.Trash directory. As long as the file is still in the .Trash directory, the file can be quickly restored. The time for which the file is saved in Trash is configurable, and when it is exceeded, Namenode deletes the file from the namespace. Deleting a file causes the blocks associated with the file to be released. Note that there is a delay between the time the user deletes the file and the increase in HDFS free space.

As long as the deleted file is still in the .Trash directory, the user can restore the file. If the user wants to recover the deleted file, he / she can browse the .Trash directory to retrieve the file.

Reduce copy coefficient

When the copy coefficient of a file is reduced, Namenode will select excess copies to delete. This information will be passed to Datanode the next time the heartbeat is detected. Datanode then removes the corresponding data blocks and increases the free space in the cluster. Similarly, there is a delay between the end of the call to setReplication API and the increase in free space in the cluster.

[this article is the original content of the user, and the source, link, author and other basic information of the article must be marked when it is reprinted]

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.