HDFS architecture and what are its advantages and disadvantages 07/01 Update SLTechnology News&Howtos

HDFS architecture and what are its advantages and disadvantages

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail about the HDFS architecture and what are its advantages and disadvantages. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Brief introduction of HDFS Architecture and its advantages and disadvantages

Brief introduction of Architecture

HDFS is a master / slave (Master, Slave) architecture. From the end user's point of view, it can perform CRUD (create, read, modify, delete) and other operations on files through directory paths, just like traditional file systems. However, due to the nature of distributed storage, the HDFS cluster has one NameNode and multiple DataNode. NameNode manages the metadata of the file system and DataNode stores the actual data. The client accesses the file system through interaction with NameNode and DataNode. The client contacts the NameNode to get the metadata of the file, while the real file Icano operation interacts directly with the DataNode.

NameNode:

NameNode is the management node of the entire file system.

Function:

1. Responsible for managing the namespace, cluster configuration information and storage block replication of the file system

2. Maintain the meta-information of the file directory tree and file root directory of the entire file system and the list of data blocks corresponding to each file.

3. Receive the user's operation request

4. Manage the relationship between files and block, and between block and DataNode

NameNode stores the Meta-Data of the file system in memory, which mainly includes the file information, the information of the file block corresponding to each file, and the information of each file block in the DataNode and so on. Without NameNode, the file system will not be available. In implementation, if the machine running the NameNode service is destroyed, all files on the file system will be lost, because we don't know if the files are rebuilt according to the blocks of DataNode. Therefore, it is important to implement fault tolerance for NameNode, and Hadoop provides two mechanisms for this:

The first mechanism is to back up the files that make up the persistent state of the file system metadata. Hadoop can be configured to enable NameNode to keep the persistent state of metadata on multiple file systems. These writes are real-time synchronous and atomic. The general configuration is to write the persistent state to a remotely mounted network file system (NFS) while writing the persistent state to the local disk.

The second mechanism: run a secondary NameNode, but it cannot be used as a NameNode. The important role of this auxiliary NameNode is to periodically merge namespace images by editing logs to prevent editing logs from being too large. This secondary NameNode typically runs on a separate physical computer because it takes up the same amount of memory as CPU time and NameNode to perform merge operations. It reports a copy of the merged namespace image and enables it when NameNode sends a failure. However, the total number of errors reported by the secondary NameNoDE lags behind the primary node, so it is inevitable to lose some data when the primary node fails completely. In this case, the NameNode metadata stored on the NFS is generally copied to the secondary NameNode and run as the new primary NameNode.

Files in NameNode:

Fsimage: metadata image file. Stores metadata information in NameNode memory for a certain period of time.

Edits: operation log file.

Fstime: the time when the last checkpoint was saved.

SecondaryNameNode:

A solution of HA (dual Cluster system for short) is not a hot backup of NameNode.

Function:

1. Assist NameNode to share its workload

2. Merge fsimage and edits regularly and push them to NameNode

3. Reduce the startup time of NameNode

4. In case of emergency, NameNode can be restored.

Execution process:

Download metadata information (fsimage,edits) from NameNode, then merge the two to generate a new fsimage, save it locally, push it to NameNode, and reset the edits of NameNode.

DataNode:

DataNode is a storage service that provides real file data and is the basic unit of file storage. It stores Block in the local file system, saves the Meta-data of Block, and periodically sends all existing Block information to NameNode.

The DataNode is also the worker node of the file system, storing and retrieving the database as needed (scheduled by the client or NameNode), and periodically sending a list of the blocks in which they are stored to NameNode.

Block (Block) is the most basic storage unit in DataNode.

The concept of data blocks:

For file memory, the length of a file is size, so starting from the zero offset of the file, the file is divided and numbered in a fixed size and order, and each block is called a Block.

In HDFS, the default Block size of HDFS is 64MB. Unlike ordinary file systems, in HDFS, if a file is smaller than the size of a data block, it will not take up the entire block storage space.

Why are the blocks in HDFS so large?

HDFS blocks are larger than disk blocks and are designed to minimize addressing overhead. If the block is set large enough, the time to transfer data from disk can be significantly longer than the time required for this quick start position. In this way, the time to transfer a file consisting of multiple blocks depends on the disk transfer rate.

In many cases, HDFS uses the setting of 128MB. However, this parameter will not be set too much, and map tasks in MapReduce usually process data in one block at a time, so if the number of tasks is too small (less than the node data in the cluster), the job will run slowly.

There are multiple copies of each file, and the default is 3 in HDFS. Can be configured in hdfs-site.xml (dfs.replication property).

Master in HDFS:

In the Master configuration file under conf in Hadoop, the main roles of the nodes in this file are:

1. Manage the namespace of HDFS

2. Manage block mapping information

3. Configure the duplicate policy

4. Handle client read and write requests

Slave in HDFS:

The nodes configured in the Slaves file under the conf directory in Hadoop play a major role:

1. Store the actual data blocks

2. Perform block read / write

Client in HDFS:

Function:

1. File segmentation interacts with NameNode to obtain file location information

2. Interact with DataNode, read or write data

3. Manage HDFS

4. Visit HDFS

This is the end of this article on "HDFS architecture and its advantages and disadvantages". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.