What is the reading and writing process of HDFS? 07/19 Update SLTechnology News&Howtos

What is the reading and writing process of HDFS?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, the editor will share with you the relevant knowledge about what the reading and writing process of HDFS is. The content is detailed and the logic is clear. I believe most people still know too much about this, so share this article for your reference. I hope you can get something after reading this article. Let's learn about it.

1. The process of reading a file

As shown in the figure, the process of reading a file consists of the following six steps:

Open distributed file: call the distributed file DistributedFileSystem.open () method; address request: get the address of DataNode from NameNode. DistributedFileSystem calls NameNode,NameNode to return the DataNode address containing the copy using RPC mode, and DistributedFileSystem returns an input stream object (FSDataInputStream), which encapsulates the input stream DFSInputStream; to connect to DataNode: call the input stream FSDataInputStream.read () method to connect DFSInputStream to DataNodes Get the data from the DataNode: transfer the data from the DataNode to the client by calling the read () method in a loop; read another DataNode until it is complete: when it reaches the end of the block, the input stream DFSInputStream closes the connection with the DataNode, finds the next DataNode; to complete the reading, and closes the connection: that is, calling the input stream FSDataInputStream.close (); 2. File writing process sends a request to create a file: call the distributed file system DistributedFileSystem.create () method; NameNode creates a file record: the distributed file system DistributedFileSystem sends a RPC request to NameNode,NameNode to check permissions and returns the output stream FSDataOutputStream, which encapsulates the output stream DFSOutputDtream; client to write data: the output stream DFSOutputDtream divides the data into packets and writes them to the internal queue. DataStreamer asks NameNode to allocate appropriate new blocks to store data backups based on the DataNode list. A set of DataNode constitutes a pipeline (Socket streaming communication is used between the pipeline DataNode); data is transmitted using the pipeline: the DataStreamer streams the packet to the first DataNode of the pipeline, and the first DataNode is then transmitted to the second DataNode until it is completed; the acknowledgement queue: the DataNode receives the data and sends an acknowledgment, and all the acknowledgments of the DataNode of the pipeline form an acknowledgement queue. All DataNode confirms that the pipeline packet is deleted; shutdown: the client calls the close () method on the amount of data. Write all remaining data to the DataNode pipeline, contact NameNode and send the file to wait for confirmation before writing the completion information; NameNode confirmation: fault handling: if there is a fault in the process, close the pipeline first and add all the packets in the queue back to the queue to ensure that the packets are not leaked. Specify a new identity for the current block of another normal DataNode and pass that identity to the NameNode. Once the failed DataNode deletes the incomplete block after recovery. Remove the faulty DataNode from the pipeline and write the remaining blocks to the remaining normal DataNode. When NameNode finds that two replicas are insufficient, it will create a new replica on another node.

In the process of data reading, it is inevitable to encounter network failures, dirty data, DataNode failure and other problems, which have long been taken into account in the design of HDFS. Let's take a look at the data corruption handling process:

When DataNode reads block, it calculates checksum. If the calculated checksum is different from the value when the block was created, the block has been corrupted. Client reads block on other DataNode. NameNode marks the block as corrupted, and then copies block to the expected number of file backups. DataNode validates its checksum after its file is created. These are all the contents of the article "what is the reading and writing process of HDFS". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.