How to read and write by HDFS 07/19 Update SLTechnology News&Howtos

How to read and write by HDFS

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to read and write HDFS. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

First, write operation

Premise: the File size is 200m block block. size 128m block block is divided into two blocks: block1 and block2 (blocks smaller than block will not take up a real block size, but the actual size).

Flow chart of write operation

1. Client initiates a write request to NameNode.

2. NameNode returns a list of available DataNode.

If client is a DataNode node, then when storing block, the rule is: replica 1, on the node of the current client; replica 2, on different rack nodes; replica 3, on another node of the same second replica rack; other replicas are randomly selected.

If client is not a DataNode node, the rule for storing block is: replica 1, randomly selected on one node; replica 2, different replica 1, on the rack; replica 3, on another node that is the same as replica 2; other replicas are randomly selected.

3. Client sends block1 to DataNode. The data is first written to the Buffer inside the FSDataOutputStream object, and then the data is divided into data package, each Packet size is 64k bytes, and each Packet is composed of a set of chunk and the corresponding checksum data. The default chunk size is 512 bytes, and each checksum is the checksum data calculated for the 512 bytes of data.

When the byte stream data written by Client reaches the size of a Packet, the Packet is built and placed in queue dataQueue, and then the DataStreamer thread constantly fetches the Packet from the dataQueue queue, sends it to the first DataNode in the replication Pipeline, and moves the Packet from the dataQueue queue to the ackQueue queue. The ResponseProcessor thread receives the ack sent from the Datanode. If it is a successful ack, it means that all the Datanode in the replicated Pipeline have received the Packet,ResponseProcessor thread to remove the packet from the queue ackQueue.

During the sending process, if an error occurs, all outstanding Packet is removed from the ackQueue queue, then a new Pipeline is recreated, excluding those DataNode nodes that are in error, and the DataStreamer thread continues to send Packet from the dataQueue queue.

We describe the internal process from the following three aspects:

Create Packet

When Client writes data, it caches the byte stream data into an internal buffer. When the length satisfies a Chunk size (512B), a Packet object is created, and then the Chunk Checksum checksum data is written to the Packet object, as well as the actual data block Chunk Data, which is calculated based on the actual data block. Each time a Chunk size is met, the above data is written to the Packet until a Packet object size (64K) is reached, and the Packet object is placed in the dataQueue queue, waiting for the DataStreamer thread to take it out and send it to the DataNode node.

Send Packet

The DataStreamer thread takes the Packet object from the dataQueue queue, puts it into the ackQueue queue, and then sends the data corresponding to the Packet object to the DataNode node.

Receive ack

After a Packet packet is sent, there is a ResponseProcessor thread that receives the ack. If a successful ack is received, a Packet is sent successfully. If successful, the ResponseProcessor thread removes the corresponding Packet from the ackQueue queue.

4. When the first DataNode1 receives the data package and writes successfully, it is sent to the DataNode2 and accepts the second data package, and so on.

5. When the second block is sent after the block1 is sent, and when the second block is also sent, DataNode sends a notification to Client, and client sends a message to NameNode saying that I am finished. Then close close.

I. read operation

1. Client sends a read request to NameNode.

2.NameNode returns a list of block locations based on network distance (the list is sorted according to network distance).

Network distance:

If the client is a DataNode node, the data of the current node is read first when reading the block.

If client is not a DataNode node, the block is read according to: processes on the same node-- > different nodes on the same rack-- > nodes on different racks of the same data center-- > nodes of different data centers.

3. Read the data according to the block location.

This is the end of the article on "how to read and write HDFS". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.