What is the basic concept of HDFS 07/06 Update SLTechnology News&Howtos

What is the basic concept of HDFS

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what is the basic concept of HDFS", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "what is the basic concept of HDFS" this article.

I. basic concepts of HDFS

1. Data block

By default, the most basic storage unit of HDFS is 64m data blocks, which can be understood to be the same as those in ordinary files.

2. Metadata nodes and data nodes

The metadata node (namenode) is used to manage the namespace of the file system. It stores the metadata of all files and folders in a file system tree.

Data nodes (datanode) are used to store data files.

The slave metadata node (secondarynamenode) is not the backup node of the metadata node we imagined. In fact, its main function is to periodically merge the namespace image file of the metadata node and the modification log to prevent the log file from being too large.

Let's first find out the relationship between these three nodes. In fact, the things stored on the metadata node are equivalent to the directories in the general file system, mapping files with namespaces and modified logs, but the distributed file system distributes the data on each machine for storage. You should be able to understand it if you take a look at these illustrations below!

The process of checkpoint between Namenode and secondary namenode.

3. Data flow in HDFS

Read the file

The client (client) uses the open () function of FileSystem to open the file, and DistributedFileSystem calls the metadata node with RPC to get the block information of the file. For each data block, the metadata node returns the address of the data node where the data block is saved. DistributedFileSystem returns FSDataInputStream to the client to read the data. The client calls the read () function of stream to start reading the data. The DFSInputStream connection holds the nearest data node of the first data block of this file. Data reads from the data node to the client (client), and when the data block is read, DFSInputStream closes the connection to this data node and then connects to the nearest data node of the next data block in this file. When the client finishes reading the data, it calls the close function of FSDataInputStream.

The whole process is shown in the figure:

Write a file

The client calls create () to create the file, and DistributedFileSystem invokes the metadata node with RPC to create a new file in the file system's namespace. The metadata node first determines that the file does not exist and that the client has permission to create the file, and then creates a new file. DistributedFileSystem returns DFSOutputStream, and the client is used to write data. The client begins to write data, and DFSOutputStream divides the data into blocks and writes it to data queue. Data queue is read by Data Streamer and tells the metadata node to allocate data nodes to store data blocks (3 blocks are replicated by default). The assigned data nodes are placed in a pipeline. Data Streamer writes the block to the first data node in the pipeline. The first data node sends the data block to the second data node. The second data node sends the data to the third data node. DFSOutputStream saves the ack queue for the outgoing data block, waiting for the data node in the pipeline to tell you that the data has been written successfully. If the data node fails in the process of writing: close the pipeline and put the data blocks in the ack queue into the beginning of the data queue.

The whole process is shown in the figure:

Architecture and Design of HDFS

Hadoop is also a distributed software framework that can deal with large amounts of data, all on a reliable, efficient and scalable basis. Reliability of Hadoop-because Hadoop assumes that computing elements and storage will fail, because it maintains multiple copies of working data and can redistribute and process failed nodes in the event of a failure. The efficiency of Hadoop-under the idea of MapReduce, Hadoop works in parallel to speed up task processing. The scalability of Hadoop depends on the size of the computing cluster that deploys the Hadoop software framework. The operation of Hadoop is scalable and has the ability to deal with PB-level data.

Hadoop is mainly composed of HDFS (Hadoop Distributed File System) and MapReduce engine. At the bottom is HDFS, which stores files on all storage nodes in the Hadoop cluster. One layer above HDFS is the MapReduce engine, which consists of JobTrackers and TaskTrackers.

HDFS can perform operations such as creating, deleting, moving, or renaming files, and the architecture is similar to the traditional hierarchical file system. It is important to note that the architecture of HDFS is based on a specific set of nodes (see figure 2), which is its own characteristic. HDFS includes a unique NameNode, which provides metadata services within HDFS, and DataNode provides storage blocks for HDFS. Because NameNode is unique, this is also a weakness of HDFS (single point of failure). Once the NameNode fails, the consequences can be imagined.

1. HDFS architecture (as shown in figure)

2. The design of HDFS

1) error detection and fast and automatic recovery are the core architectural goals of HDFS.

2) rather than focusing on the low latency of data access, the more important thing is the high throughput of data access.

3) HDFS applications require write-one-read-many access model for files.

4) the cost of mobile computing is lower than that of mobile data.

3. Namespace of the file system

Namenode maintains the namespace of the file system, and all changes to namespace and file properties are recorded by namenode, even the number of copies of the file is called the replication factor, which is also recorded by namenode.

4. Data replication

Namenode has full management of block replication, periodically receiving heartbeats and a Blockreport from each Datanode in the cluster. The reception of the heartbeat indicates that the Datanode node is working properly, and the Blockreport includes a list of all the block on the Datanode. HDFS uses a strategy called rack-aware to improve the reliability and validity of data and the utilization of network bandwidth. Complete the storage of the copy.

5. Persistence of file system metadata

Namenode holds images of the entire file system namespace and the file Blockmap in memory. This critical metadata is designed to be compact, so a Namenode with 4G memory is sufficient to support a large number of files and directories. When Namenode starts, it reads Editlog and FsImage from the hard disk, apply all the transactions in Editlog in the FsImage in memory, and flush this new version of FsImage from memory to the hard disk, and then truncate the old Editlog, because the transactions of the old Editlog are already on the FsImage. This process is called checkpoint. In the current implementation, checkpoint occurs only when Namenode is started, and we will implement periodic checkpoint support in the near future.

6. Communication protocol

All HDFS communication protocols are built on the TCP/IP protocol. The client connects to Namenode through a configurable port and interacts with Namenode through ClientProtocol. Datanode, on the other hand, uses DatanodeProtocol to interact with Namenode. Abstracting a remote call (RPC) from ClientProtocol and Datanodeprotocol, by design, Namenode does not initiate RPC actively, but responds to RPC requests from the client and Datanode.

These are all the contents of the article "what are the basic Concepts of HDFS". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.