Talking about HDFS Architecture 04/26 Update SLTechnology News&Howtos

Talking about HDFS Architecture

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1 、 HDFS

HDFS (Hadoop Distributed File System) is the core sub-project of the Hadoop project and the basis of data storage management in distributed computing. It is developed based on the requirements of streaming data mode to access and deal with super-large files, and can be run on cheap commercial servers. It has the characteristics of high fault tolerance, high reliability, high scalability, high availability, high throughput and so on. It provides fault-free storage for massive data and brings a lot of convenience for the application and processing of super-large data sets (Large Data Set).

2 、 HDFS

HDFS originates from the GFS (Google File System) paper published by Google in October 2003.

3. The advantages and disadvantages of HDFS.

Advantages:

1. High fault tolerance

Multiple copies of the data are automatically saved. It improves fault tolerance by adding copies.

After a copy is lost, it can be restored automatically, which is achieved by the internal mechanism of HDFS, and we don't have to worry about it.

2. Suitable for batch processing

It is through mobile computing rather than moving data.

It exposes the location of the data to the computing framework.

3. Suitable for big data

Deal with data that reaches the GB, TB, or even PB level.

Able to handle more than one million documents, the number is quite large.

Able to handle the size of 10K nodes.

4. Streaming file access

Simplify the consistency model to write once and read multiple times. Once the file is written and cannot be modified, it can only be appended.

It can ensure the consistency of data.

5. Can be built on cheap machines

It improves reliability through multi-copy mechanism.

It provides fault tolerance and recovery mechanisms. For example, if a copy is lost, it can be restored by other copies.

Disadvantages:

1. Low latency data access

* for example, to store data in milliseconds, this is not possible, it cannot be done.

* it is suitable for high-throughput scenarios where a large amount of data is written at a certain time. But it is not good in the case of low latency, such as reading data within milliseconds, so it is very difficult to do so.

2. Small file storage

* if a large number of small files are stored (the small file here refers to a document smaller than the Block size of the HDFS system (the default is 64m), it will take up a lot of NameNode memory to store text, directory and block information. This is not desirable because the memory of NameNode is always limited.

* the seek time of small files will exceed the read time, which violates the design goal of HDFS.

3. Write concurrently and modify files randomly

* A file can only have one write, and multiple threads are not allowed to write at the same time.

* only data append (append) is supported, and random modification of files is not supported.

Features of hdfs:

* High fault tolerance, scalability and configurability

* Cross-platform

* shell command interface

* Rack Awareness

* load balancing

Web interface

4.HDFS stores data

HDFS uses Master/Slave architecture to store data, which is mainly composed of four parts: HDFS Client, NameNode, DataNode and Secondary NameNode.

1. Client: the client.

File slicing. When the file is uploaded to the HDFS, the Client splits the file into a Block and stores it.

Interact with NameNode to get the location information of the file.

Interact with DataNode, read or write data.

Client provides commands to manage HDFS, such as starting or shutting down HDFS.

Client can access HDFS through some commands.

2. NameNode: it is master, which is a supervisor and manager.

Manage the namespaces of HDFS

Manage block (Block) mapping information

Configure replica Policy

Handle client read and write requests.

3. DataNode: it is Slave. NameNode gives the command, and DataNode performs the actual operation.

Stores the actual blocks of data.

Perform read / write operations on the data block.

4. Secondary NameNode: not a hot backup for NameNode. When NameNode dies, it does not immediately replace NameNode and provide services.

Assist NameNode and share its workload.

Merge fsimage and fsedits periodically and push them to NameNode.

In case of emergency, the recovery of NameNode can be assisted.

Workflow:

Secondarynamenode informs namenode to switch edits files

Secondarynamenode obtains fsimage and edits from namenode (via http)

Secondarynamenode loads fsimage into memory and starts merging edits

Secondarynamenode sends the new fsimage back to namenode

Namenode replaces the old fsimage with the new fsimage

5. Data damage handling

* when DN (DataNode) reads block, it calculates checksum

* if the calculated checksum is different from the value when the block was created, the block has been corrupted.

* client reads the block;NN (NameNode) on other DN to mark that the block is corrupted, and then copies the number of file backups that block has set as expected.

* DN verifies its checksum three weeks after its file is created.

Follow big data and machine learning official account to create the future together.

CLbigdata

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.