How to read and write data in HDFS 04/27 Update SLTechnology News&Howtos

How to read and write data in HDFS

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces HDFS how to read and write data, the article is very detailed, has a certain reference value, interested friends must read it!

Write:

Example: upload a 200m file to the HDFS distributed system to execute the process: 1. Client sends a request for uploading a file to NameNode, and NameNode will verify it. -- (check whether permission exists, upload path exists, file exists, overwrite operation is performed, etc.) 2. NameNode responds to the client that uploads files 3. Client chunks 200m of data according to the setting of block size in the cluster, that is, it is divided into 128m and 72m block chunks, and requests to return DataNode node information. a. When uploading, you will request to upload the block information of 0upload 128m (the first block block) first. Because the data really exists on the DataNode node, NameNode will return dn1\ dn2\ dn3 to store the data. -- replica storage node selection mechanism, how to select a node and return? (see the NameNode section of the figure)-Why return three nodes when it is sent to only one node? I think if only one node is returned, if that node dies, I don't know where to send it, so I return three. If dn1 fails, you can choose to send it to dn2 or dn3. 4. After client gets the information of dn, it will establish a link with the dn server node, client sends a request to establish a block transmission channel, and performs write data operation a.client will also have a choice when establishing a connection with dn. Select the node closest to the client to request to establish a link channel (assuming dn1 is the closest) after b.client and dn1 establish a link channel, dn1 sends a request to dn2 internally to request the establishment of a channel Dn2 sends a request to dn3, requests to establish a channel, dn3 responds to dn2,dn2, responds to dn1,dn1, responds to client, all without problems. 5. When the channel is successfully established, it does not mean that 128m files are transferred directly, but that client transmits data to dn1 in Packet units. (packet size: 64KB), each transfer 64KB, the last transfer data size may not be enough 64KB. A.client transfers 64KB data to dn1 to memory, dn1 disk (writes data from memory to local on the current node server) and distributes it to dn2 (memory), dn2 disk is sent to dn3 (memory), dn3 disk is sent to dn3 (memory)-- logically speaking, the smallest unit in the process of b.chunk data transfer (512 bytes), read data from 0mm 128m each time, first read 512 bytes into chunk, when chunk is full Check the data in the chunk and generate a checksum (accounting for 4 bytes), that is, 516 bytes are put into the Packet, and there are more than N chunk in a Packet. How is the internal Packet of each node transmitted? There is a queue inside the hdfs called dataQuene (buffer queue). Each time a Packet is transmitted, the current Packet is put into the DQ queue. When the dn1 receives the Packet in the DQ, the Packet will be pulled out of the DQ queue and stored in another queue ackQuene (reply queue). When the dn1,dn2,dn3 and other nodes have successfully written the data in the ackq, the packet in the ackquene will be removed, indicating that the packet transmission is completed. 6. When dn1, dn2 and dn3 establish a channel Or in the process of data transmission, if a node has a problem, such as dn2 downtime, the client will re-send a request to establish a transmission channel. At this time, the down node dn2 is no longer responding to the request, dn1 and dn3 directly establish the channel, dn2 is not in use, lack of replica data, and the internal machine will be reconfigured to replace dn2. Ensure that the settings of the real node data and the replica mechanism are consistent. 7. After the data transfer is completed, the stream is closed and the data write operation of the second block is performed. The steps are the same as above:

Example: download a 200m file from the HDFS system to the local 1, client sends a download request to NameNode, NameNode returns the metadata of the target file to client, client reads through FSDataInputStream according to the metadata, client and the nearest dn node to the client establish a transmission channel a. Note: to read the data, you only need to establish a channel with one node. As long as you can read it once, you can get the real data. 3. The read data is also in Packet. Note that the first read may get the data of the first block, but the second block may exist on other nodes. In this case, the reading of the rest of the data requires client to re-establish a channel with other nodes to read. 4. After multiple blocks are read to client, they will be cached in client, and then merged and written to the target file to ensure the integrity of the read real data. These are all the contents of the article "how to read and write data in HDFS". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.