Distributed file system: HDFS 07/03 Update SLTechnology News&Howtos

Distributed file system: HDFS

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Learning Hadoop, two things can not be bypassed, MapReduce and HDFS, the last blog introduced the processing flow of MapReduce, this blog to learn about HDFS.

HDFS is a distributed file system, which uses the storage of multiple machines as one file system, because in big data's scenario, the storage capacity of a single machine is not enough, so a distributed method is adopted to expand the capacity to solve the limitations of the local file system in file size, number of files, number of files opened, and so on. Let's first take a look at the architecture of HDFS

HDFS architecture

As you can see from the picture above, the main components of HDFS are Namenode, Datanodes, Client, and several nouns: Block, Metadata, Replication, and Rack. What do they mean respectively?

For distributed file systems, data is stored on many machines, and Datanode represents these machines, where the data is actually stored. After the data is stored, we need to know which Datanode they are stored on. This is what Namenode does. It records metadata information (that is, Metadata, and its main content is which data block is stored in which directory on which Datanode. This is why HDFS is not suitable for storing a large number of small files, because Namenode places the file system's metadata in memory in order to respond quickly, so the number of files that the file system can hold is determined by the memory size of Namenode. Generally speaking, each file, folder and Block needs to occupy about 150 bytes of space. If you save 1 million small files, you need at least 300MB memory, but so many small files actually do not store too much data, which is a waste of memory.), with metadata information, we can find out the specific location of the data block through Namenode, and the tool to deal with Namenode is Client. Client provides our users with an interface to access data, and we can access data through Client.

And the remaining few nouns, Block refers to data blocks, because there are usually large files on HDFS, we need to split it into many data blocks for storage; Replication means copy, which is for the reliability of data. If a data block is lost, it can be found through its copy. HDFS defaults to storing three copies of a data block. Rack means rack and can be understood as a place where multiple Datanode are stored.

To sum up, the data is stored on Datanode and there is a copy, and Namenode knows which Datanode the data and its copy are on. When we want to find or write data, we contact Namenode through Client, which tells us where to store the data or where to get it.

HDFS read and write process

After looking at the architecture of HDFS, let's take a look at how HDFS accesses data. The first is to write the process:

Note: the following steps do not strictly correspond to the sequence of occurrence in

1) use the client development library provided by HDFS to initiate a request to the remote Namenode

2) Namenode will check whether the file to be created already exists and whether the creator has permission to operate. If it succeeds, it will create a record for the file, otherwise it will cause the client to throw an exception.

3) when the client starts to write to the file, it will split the file into multiple packets (packets), manage these packets internally in the form of "data queue" (data queue), and apply to Namenode for a new blocks to get the appropriate datanodes list to store the replication, depending on the size of the replication set in the Namenode.

4) start writing packet to all replication in the form of pipeline (pipes). The packet is written to the first datanode as a stream, and the datanode stores the packet and then passes it to the next datanode in the pipeline until the last datanode, which writes data in pipelined form.

5) after the last datanode is successfully stored, an ackpacket (confirmation packet) is returned, which is passed to the client in pipeline, and the "ack queue" (acknowledgement queue) is maintained inside the client. When the ackpacket returned by datanode is successfully received, the corresponding packet is removed from "ack queue", which means that the packet is written successfully.

6) if a datanode fails during transmission, the current pipeline will be closed, the faulted datanode will be removed from the current pipeline, and the remaining block will continue to be transmitted in the form of pipeline in the remaining datanode. At the same time, Namenode will assign a new datanode to maintain the number set by replication.

The next step is the reading process:

Note: the following steps do not strictly correspond to the sequence of occurrence in

1) use the client provided by HDFS to initiate a request to the remote Namenode

2) Namenode will return some or all of the block list of the file as appropriate, and for each block,Namenode, it will return the Datanode address with a copy of the block

3) the client will select the Datanode closest to the client to read the block

4) after reading the data of the current block, close the connection with the current Datanode and find the best Datanode for reading the next block

5) after reading the block of the list, and the file reading is not finished, the client will continue to obtain the next batch of block lists from Namenode.

6) checksum (checksum) verification will be performed after reading a block to see if there is an error in the file. In addition, if there is an error in reading the Datanode, the client will notify the Namenode, and then continue reading from the next Datanode that owns the block.

The above is the HDFS read and write process, after understanding this, we think about a few questions, for the distributed file system, how can we ensure data consistency? That is to say, what should we do if there are multiple clients writing data to the same file? In addition, we see that Namenode is very important in HDFS, it holds key metadata information, but as you can see from the architecture diagram, there is only one Namenode. If it dies, how can we ensure that the system can continue to work?

CAP theory in distributed domain

CAP theory is a classical theory in distribution, the specific contents are as follows:

Consistency (consistency): whether all data backups in distributed systems have the same value at the same time.

Availability (availability): whether the cluster as a whole can respond to read and write requests from clients after some nodes in the cluster fail.

Partition tolerance (partition fault tolerance): the system should be able to provide continuous service, even if messages are lost within the system (partition).

Consistency and availability are easy to understand. It mainly explains the partition fault tolerance, which means that due to the network, the network may be disconnected, or some machines may be down. Network delays and other causes of data exchange cannot be completed within the expected time. Because the network problem is unavoidable, we always need to solve this problem, that is, to ensure the fault tolerance of partitions. In order to ensure the reliability of the data, HDFS adopts the strategy of copy.

For consistency, HDFS provides a simple consistency model, an access mode for writing once and reading a file multiple times, and supports append operations, but cannot change the written data. Files in HDFS are written at once, and it is strictly required that there can only be one writer at any one time. When is a file written successfully? For HDFS, there are two parameters: dfs.namenode.replication.min (default is 1) and dfs.replication (default is 3). If the number of copies of a file is greater than or equal to the parameter dfs.namenode.replication.min, it will be marked as successful, but if the number of copies is less than the parameter dfs.replication, the file will also be marked as "unreplication" and will continue to write copies to other Datanode So if we want all the DataNodes to save the data to confirm that they all have copies of the file before the data is considered written, we can equate the dfs.namenode.replication.min setting to dfs.replication. Therefore, data consistency is done during the write phase, and a client will get the same data no matter which DataNode it chooses to read from.

For usability, HDFS2.0 provides high availability (High Availability) for Namenode. It is mentioned here that Secondary NameNode is not HA,Namenode because you need to know where each data block is, so name each data block and save the named file (fsp_w_picpath file). As long as you manipulate the file, record these behaviors in the editing log (edits file), for the speed of response. Both files are kept in memory, but as the file operation proceeds, the edits file becomes larger and larger, so Secondary NameNode periodically merges edits and fsp_w_picpath files to shorten the cluster startup time. When NameNode fails, Secondary NameNode cannot provide services immediately, Secondary NameNode cannot even guarantee data integrity, and if NameNode data is lost, changes to the file system after the last merge will be lost. The process for Secondary NameNode to merge edits and fsp_w_picpath files is as follows:

At present, two HA schemes are provided in HDFS2, one is based on NFS (Network File System) shared storage, one is based on Paxos algorithm Quorum Journal Manager (QJM), the following is based on NFS shared storage

As you can see from the figure above, the high availability of Namenode is achieved by setting up a Standby Namenode for it, and if the current Namenode is down, enable the backup Namenode. While information is synchronized between the two Namenode through shared storage, here are some key points:

Use shared storage to synchronize edits information between two NameNode.

DataNode reports block information to both NameNode simultaneously. This is a necessary step to keep Standby NameNode up-to-date with the cluster.

The FailoverController process used to monitor and control the NameNode process (the switcher is enabled once the working Namenode is down).

Fencing, to prevent brain fissure, is to ensure that there is only one primary NameNode at any one time, including three aspects:

Share the storage fencing to ensure that only one NameNode can be written to the edits.

The client fencing ensures that only one NameNode can respond to the client's request.

DataNode fencing, make sure that only one NameNode can issue commands to DataNode, such as deleting blocks, copying blocks, and so on.

The other is QJM scheme:

Simply put, in order to keep Standby Node synchronized with Active Node, both Node communicate with a set of independent processes called JNS (Journal Nodes). Its basic principle is to use 2N+1 JournalNode to store edits files. Every time most of the write data operations (greater than or equal to Number1) return success, the write is considered to be successful. Because the QJM scheme is based on the Paxos algorithm, and the Paxos algorithm can not be explained in two or three sentences, if you are interested, you can see this Zhihu column: Paxos algorithm or refer to the official documentation.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.