How does HDFS work? 04/21 Update SLTechnology News&Howtos

How does HDFS work?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the working principle of HDFS. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Hadoop distributed File system (HDFS) is a distributed file system designed to run on general hardware. HDFS is a highly fault-tolerant system that is suitable for deployment on cheap machines. It can provide high-throughput data access and is very suitable for applications on large-scale data sets. To understand the inner workings of HDFS, you must first understand what a distributed file system is.

1. Distributed file system

Multiple computers work together on a network (sometimes called a cluster) to solve a problem like a single system, which we call a distributed system.

Distributed file system is a subset of distributed system, and the problem they solve is data storage. In other words, they are storage systems that span multiple computers. The data stored on the distributed file system is automatically distributed on different nodes.

Distributed file systems have a wide application prospect in the era of big data. They provide the necessary scalability for storing and processing very large-scale data from the network and other places.

2. Separate metadata and data: NameNode and DataNode

Each file stored in the file system has associated metadata. The metadata includes the file name, the number of inode, the location of the data block, and so on, while the data is the actual content of the file.

In traditional file systems, because the file system does not span multiple machines, metadata and data are stored on the same machine.

In order to build a distributed file system in which the client is easy to use and does not need to know the activities of other clients, the metadata needs to be maintained outside the client. HDFS's design philosophy is to take one or more machines to hold the metadata and let the rest of the machines store the contents of the file.

NameNode and DataNode are the two main components of HDFS. Where the metadata is stored on NameNode and the data is stored on the DataNode cluster. NameNode not only manages the metadata stored on the HDFS, but also records things such as which nodes are part of the cluster, how many copies of a file, and so on. It also determines what the system needs to do when the cluster node goes down or the data copy is lost.

Multiple copies (replica) of each slice stored on the HDFS are stored on different servers. In essence, NameNode is HDFS's Master (master server) and DataNode is Slave (slave server).

3. HDFS writing process

NameNode is responsible for managing the metadata of all files stored on the HDFS. It confirms the client's request and records the name of the file and the DataNode collection in which the file is stored. It stores this information in a file allocation table in memory.

For example, the client sends a request to NameNode that it wants to write the "zhou.log" file to HDFS. Then, the execution process is shown in figure 1. The details are:

* step: the client sends a message to NameNode saying that the "zhou.log" file is to be written. (such as ① in figure 1)

Step 2: NameNode sends a message to the client, asking the client to write to DataNode A, B, and D, and contact DataNode B. (such as ② in figure 1)

Step 3: the client sends a message to DataNode B asking it to save a "zhou.log" file and send a copy to DataNode An and DataNode D. (such as ③ in figure 1)

Step 4: DataNode B sends a message to DataNode A, asking it to save a "zhou.log" file and sends a copy to DataNode D. (such as ④ in figure 1)

Step 5: DataNode A sends a message to DataNode D, asking it to save a "zhou.log" file. (such as ⑤ in figure 1)

Step 6: DataNode D sends a confirmation message to DataNode A. (such as ⑤ in figure 1)

Step 7: DataNode A sends a confirmation message to DataNode B. (such as ④ in figure 1)

Step 8: DataNode B sends a confirmation message to the client indicating that the write is complete. (such as ⑥ in figure 1)

Fig. 1 schematic diagram of HDFS writing process

In the design of distributed file systems, one of the challenges is how to ensure data consistency. For HDFS, the data is not considered written until all the DataNodes that wants to save the data confirms that they all have copies of the file. Therefore, data consistency is accomplished at the write stage. No matter which DataNode a client chooses to read from, it will get the same data.

4. HDFS reading process

To understand the reading process, you can think of a file as consisting of blocks of data stored on the DataNode. The execution process for the client to view the previously written content is shown in figure 2, as follows:

* step: the client asks NameNode where it should read the file. (such as ① in figure 2)

Step 2: NameNode sends block information to the client. (the block information contains the IP address of the DataNode that holds the copy of the file, and the block ID that DataNode needs to find the block on the local hard disk.) (such as ② in figure 2)

Step 3: the client checks the block information, contacts the relevant DataNode, and requests the block. (such as ③ in figure 2)

Step 4: DataNode returns the contents of the file to the client, then closes the connection and completes the read operation. (such as ④ in figure 2)

Fig. 2 schematic diagram of HDFS reading process

The client obtains the data blocks of a file from different DataNode in parallel, and then joins these data blocks to form a complete file.

5. Quickly recover from hardware failure through copy

When everything is working properly, DataNode periodically sends heartbeat information to NameNode (default is every 3 seconds). If NameNode does not receive a heartbeat within the scheduled time (the default is 10 minutes), it thinks there is something wrong with DataNode, removes it from the cluster, and starts a process to recover the data. DataNode may leave the cluster for a variety of reasons, such as hardware failure, motherboard failure, power aging, network failure and so on.

For HDFS, losing a DataNode means losing a copy of the block stored on its hard disk. If there is always more than one copy at any time (the default is 3), the failure will not result in data loss. When a hard disk fails, HDFS detects that the number of copies of the blocks stored on the hard disk is lower than required, and then proactively creates the required copies to reach the full number of copies.

6. Split files across multiple DataNode

In HDFS, the file is split into blocks, usually each block 64MB~128MB, and then each block is written to the file system. Different blocks of the same file are not necessarily saved on the same DataNode. The advantage of this is that when operations are performed on these files, different parts of the file can be read and processed in parallel.

When the client is ready to write a file to HDFS and asks NameNode where to write the file, NameNode tells the client which DataNode can write to the block. After writing a batch of data blocks, the client goes back to NameNode to get the new DataNode list and writes the next batch of data blocks to the DataNode in the new list.

This is how the editor shares the working principle of HDFS. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.