How does hdfs work? 04/29 Update SLTechnology News&Howtos

How does hdfs work?

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how hdfs works. It is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

First, when writing data to hdfs with client.

When we write data to hdfs, the client does two things. First of all, the data file will be divided into different data blocks, the default is 64MB or 128MB, this is configurable. The second is to request a batch of (actually the default 3) datanode from namenode to store data blocks. Of course, namenode does not randomly select three datanode to client, it will choose the three datanode closest to the client, so how is the distance calculated? it calculates the sum of the bandwidth of the two nodes and the parent node to calculate the distance between the two nodes. After getting the three nearest datanode, namenode will first sort the datanode according to the distance from the client, and then return it to the client.

The next thing client needs to do is to transfer data to the 3 datanode returned by namenode. When it writes the first data block, it first writes data to the nearest datanode. So how does client know if it writes data to the first datanode successfully? This is the case, when client sends data to datanode, the accountant calculates the fast checksum of the data, and this checksum will also be passed to datanode,datanode after storing the data, it will take down the checksum of the data block and compare it with the checksum of client, if it is consistent, it will show that the data has been saved successfully, and then it will ack to client to tell client that the data has been successfully saved, and it will also tell namenode that the data block has been saved successfully. The process of saving the remaining data blocks to the next two datanode is similar to the first process, except that the data transmission is not all transmitted by client, but between datanode. After each datanode successfully saves the data, it will ack to client and inform namenode that the data is finished. When client receives all datanode ack, client tells namenode that all blocks have been written. When namenode receives the message from client, what namenode needs to do is to maintain two tables, one table is the datanode address corresponding to each data block, and the other is the pipeline through which the copy of the data is stored.

Second, when reading data from hdfs.

When client wants to read data from hdfs, the client first needs to know where the data is stored. How can I know? who knows? Namenode, of course, namenode stores all the block metadata information. The client will request the file address of the data to namenode, and namenode will return to client the relevant information about the data blocks, that is, which datanode the data blocks are stored on, and which data blocks are stored in each datanode. Of course, these datanode are sorted according to the distance from the client. When client gets the information, it first goes to the nearest datanode to download the data.

Third, the avoidance of faults.

Because it is a distributed file system, anything can happen in the network, such as datanode hanging, datanode can not return the data that the client wants, datanode data corruption while saving data, or the most serious namenode hanging.

Let's take a look at these four questions and how hadoop avoids it.

First, datanode hangs up. When hdfs starts, each datanode node will report its own health status periodically, and each datanode will send a heartbeat to namenode every three seconds to prove that it is still or, if namenode does not receive the heartbeat sent by datanode for a second, namenode will think that the datanode has died.

The next problem is that datanode cannot return the data that the client wants. When the client datanode requests or writes data, what if datanode has no corresponding response? when the client does not receive the ack of datanode for a long time, client also thinks that the datanode is dead, so it will skip this datanode and request the next datanode.

The third question is that the data saved by datanode is corrupted. What should we do? datanode will regularly report the fast health status of its stored data, which is judged by checksums. When nomenode receives a report on the health status of data blocks sent by datanode, it knows that those data blocks are damaged, and then it will update the two tables it maintains, that is, which data blocks are stored on which datanode and which data blocks are stored on each datanode. If namenode finds that a copy of a data block is not up to standard, it will notify other datanode to copy the corresponding data block from the datanode of the existing data block.

The fourth problem, namenode hang up, this problem is the most serious. At present, the best solution is to add an auxiliary namenode, namely secondNamenode, to prepare the metadata information of fractional data blocks, so as to avoid data loss as much as possible.

The above content is what is the working principle of hdfs? have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.