What are the knowledge points of architecture and design in Hadoop distributed file system 04/12 Update SLTechnology News&Howtos

What are the knowledge points of architecture and design in Hadoop distributed file system

2025-04-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what are the knowledge points of architecture and design in Hadoop distributed file system". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

First, robustness

The main goal of Hadoop distributed file system HDFS is to achieve data storage reliability in case of failure. Three common failures: Namenodefailures,Datanodefailures and network segmentation (networkpartitions).

1. Hard disk data error, heartbeat detection and re-replication

Each Datanode node sends heartbeats periodically to the Namenode. Network cutting may cause some Datanode to lose contact with Namenode. Namenode detects this by missing heartbeats and marks these Datanode as dead and does not send new IO requests to them. Any data stored on deadDatanode will no longer be valid. The death of Datanode may cause the number of copies of some block to fall below the specified value, and Namenode keeps track of the block that needs to be replicated, starting replication whenever necessary. Re-replication may be required when a Datanode node fails, a copy is corrupted, a hard disk error on the Datanode is wrong, or the replication factor of the file increases.

2. Cluster equilibrium

HDFS supports a data balancing plan, and if the free space on a Datanode node is below a specific tipping point, a plan is initiated to automatically move data from an Datanode to a free Datanode. When requests for a file suddenly increase, it is also possible to initiate a plan to create a new copy of the file and distribute it to the cluster to meet the requirements of the application. These equilibrium plans have not yet been realized.

3. Data integrity

Data blocks obtained from a Datanode may be corrupted, which may be caused by Datanode storage device errors, network errors, or software bug. The HDFS client software implements the checksum of the contents of the HDFS file. When a client creates a new Hadoop distributed file system HDFS file, the checksum for each block of the file is calculated and saved as a separate hidden file under the same HDFSnamespace. When the client retrieves the contents of the file, it confirms whether the data obtained from the Datanode matches the checksum in the corresponding checksum file, and if not, the client can choose to obtain a copy of the block from another Datanode.

4. Metadata disk error

FsImage and Editlog are the core data structures of HDFS. If these files are corrupted, the entire HDFS instance will be invalidated. Therefore, Namenode can be configured to support the maintenance of multiple copies of FsImage and Editlog. Any changes to FsImage or Editlog will be synchronized to their copies. This synchronization operation may reduce the number of namespace transactions that Namenode can support per second. This price is acceptable because HDFS is data-intensive, not metadata-intensive. When Namenode is restarted, it always selects the most recent consistent FsImage and Editlog usage.

Namenode is a single point of existence in HDFS, and manual intervention is necessary if the machine on which Namenode is located is wrong. Currently, the ability to restart Namenode, which is out of service due to a failure, on another machine has not been implemented.

5. Snapshot

Snapshots support copying of data at a certain time, and when HDFS data is corrupted, it can be restored to a known correct point in time. Snapshots are not currently supported by HDFS.

II. Data organization

1. Data block

Applications compatible with Hadoop distributed file system (HDFS) deal with big data collections. These applications write data once, but read it one or more times, and read at a speed that meets streaming reading. HDFS supports the write-once-read-many semantics of files. A typical block size is 64MB, so files are always split into chunk according to 64m, with each chunk stored in a different Datanode

2. Steps

A client's request to create a file is not immediately sent to Namenode. In fact, the HDFS client caches the file data to a local temporary file. Applied writes are transparently redirected to this temporary file. When the data accumulated by this temporary file exceeds the size of a block (the default is 64m), the client will contact the Namenode. Namenode inserts the file name into the hierarchy of the file system, assigns a data block to it, and then returns the identifier of Datanode and the target data block to the client. The client flush the local temporary file to the specified Datanode. When the file is closed, the remaining data without flush in the temporary file will also be transferred to the specified Datanode, and then the client tells Namenode that the file has been closed. At this point, Namenode commits the file creation operation to the persistent store. If Namenode dies before the file is closed, the file will be lost.

The above method is the result of careful consideration of the target application running on HDFS. If the client-side cache is not used, the network speed and network congestion will have a great impact on the throughput.

3. Pipeline replication

When a client writes data to a HDFS file, it first writes a local temporary file, assuming that the file's replication factor is set to 3, then the client will get a Datanode list from Namenode to store the copy. Then the client begins to transmit data to * Datanode, * Datanode receives data in a small part (4kb), writes each part to the local warehouse, and transmits the part to the second Datanode node at the same time. The same is true of the second Datanode, which receives and transmits a small part of it, stores it in the local warehouse, and transmits it to the third Datanode, and the third Datanode is just received and stored. This is pipelined replication.

III. Accessibility

Hadoop distributed file system (HDFS) provides a variety of access methods for applications, which can interact with HDFS data through the command line through DFSShell, call through javaAPI, or access through API encapsulated in C language, and provides a way of browser access. A way to access through the WebDav protocol is being developed. Specific use of reference documents.

IV. Recovery of space

1. Deletion and recovery of files

The user or application deletes a file that is not immediately deleted from HDFS. Instead, HDFS renames the file and moves it to the / trash directory. While the file is still in the / trash directory, the file can be quickly restored. The time for which the file is saved in / trash is configurable, and when it is over, Namenode deletes the file from namespace. The deletion of the file also releases the data blocks associated with the file. Notice that there is a wait time delay between the deletion of the file by the user and the increase in HDFS free space.

When the deleted file is still in the / trash directory, if the user wants to restore the file, he can retrieve the browse / trash directory and retrieve the file. The / trash directory holds only the most recent copy of the deleted file. The / trash directory is no different from other file directories, except for one thing: HDFS applies a special policy to automatically delete files in this directory. The current default policy is to delete files that are retained for more than 6 hours. This policy will be defined as a configurable interface later.

2. The decrease of Replication factor.

When the replication factor of a file decreases, Namenode selects the excess copy to delete. The next heartbeat detection passes this information to Datanode,Datanode, which removes the corresponding block and frees up space. Similarly, there is a time delay between the call to the setReplication method and the increase in free space in the cluster.

This is the end of the introduction of "what are the knowledge points of architecture and design in Hadoop distributed file system". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.