How to parse the file writing process of HDFS 04/26 Update SLTechnology News&Howtos

How to parse the file writing process of HDFS

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to parse HDFS's document writing process. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

How is the file written to HDFS?

Let's take a look at the following "write" flowchart:

If we have a file test.txt and want to put it on Hadoop, execute the following command:

Quote

# hadoop fs-put/ usr/bigdata/dataset/input/20130706/test.txt / opt/bigdata/hadoop/dataset/input/20130706 / / or execute the following command

# hadoop fs-copyFromLocal / usr/bigdata/dataset/input/20130706/test.txt / opt/bigdata/hadoop/dataset/input/20130706

The whole writing process is as follows:

In the first step, the client calls the create () method of DistributedFileSystem to start creating a new file: DistributedFileSystem creates DFSOutputStream and generates a RPC call that lets NameNode create the new file in the namespace of the file system

In the second step, after NameNode receives the RPC request from the user to write to the file, who needs to perform various checks first, such as whether the customer has the relevant Creative permission and whether the file already exists, a new file will be created after the check is passed, and the operation will be recorded in the editing log, and then DistributedFileSystem will wrap the DFSOutputStream object in the FSDataOutStream instance and return it to the client; otherwise, the file creation fails and IOException will be thrown to the client.

Third, the client starts writing the file: DFSOutputStream splits the file into packets packets and writes the packets to one of its internal ones called data queue (data queue). Data queue asks the NameNode node for a list of DataNode nodes suitable for storing copies of the data, and then these DataNode generate a Pipeline data flow pipeline before. We assume that the replica set parameter is set to 3, then there are three DataNode nodes in this data flow pipeline.

Step 4, first, DFSOutputStream writes the packets to the first DataNode node in the Pipeline data flow pipeline, and the first DataNode receives the packets and then writes the packets to the second node in the Pipeline. Similarly, the second node saves the received data and then writes the data to the third DataNode node in the Pipeline.

Step 5, DFSOutputStream also maintains another internal write data confirmation queue-ack queue. When the third DataNode node in the Pipeline successfully saves the packets, the node returns a message confirming the success of the data writing to the second DataNode. After receiving the confirmation message, the second DataNode will also send a message confirming the success of the data writing to the first DataNode node in the Pipeline after the current node has successfully written the data. Then, after the first node receives the information, if the data of the node is also written successfully, The packets is deleted from the ack queue.

In the process of writing data, what will happen if a DataNode node in the Pipeline data flow pipeline fails to write, and what internal processing needs to be done? If this happens, some actions are performed:

First, the Pipeline data flow pipeline is closed and the packets in the ack queue is added to the front of the data queue to ensure that packets packets are not lost, specify a new identity for the current data stored in another normal dataname, and pass that identity to the namenode so that the failed datanode can delete some of the stored data blocks after recovery

Next, the ID version of the saved block on the normal DataNode node is upgraded-- so that the block data on the failed DataNode node is deleted after the node returns to normal, and the failed node is deleted from the Pipeline

Finally, the rest of the data is written to the other two nodes in the Pipeline data flow pipeline.

If multiple nodes in the Pipeline fail to write data, then as long as the number of successful block reaches dfs.replication.min (default is 1), then the task is written successfully, and then NameNode copies the block to other nodes in one step, and the final copy of the data reaches the number configured by the dfs.replication parameter.

Step 6, after completing the write operation, the client calls close () to close the write operation and refresh the data

The seventh step is to close the write operation stream after NameNode after the data has been refreshed. At this point, the whole write operation is complete.

Thank you for reading! This is the end of this article on "how to parse the document writing process of HDFS". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.