What is the principle of HDFS? 04/26 Update SLTechnology News&Howtos

What is the principle of HDFS?

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the principle of HDFS". In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations. I hope you can read carefully and learn something!

The main design concept of HDFS

1. Store large files

Here,"super large files" refer to files of hundreds of MB, GB or even TB.

The most efficient access mode is write once, read many times (streaming data access)

HDFS stores data sets as hadoop analysis objects. After the data set is generated, various analyses are performed on this data set for a long time. Each analysis will design most or even all of the data for that dataset, so the time delay to read the entire dataset is more important than the time delay to read the first record.

3. Run on ordinary cheap servers

HDFS is designed to run on ordinary hardware, and even if the hardware fails, it can ensure high availability of data through fault tolerance policies.

II. Taboo of HDFS

Use HDFS for scenarios requiring low latency for data access

Because HDFS is designed for high data throughput applications, it comes at the cost of high latency.

2. Store a large number of small files

HDFS metadata (the basic information of the file) is stored in namenode memory, while namenode is a single point, and the number of small files is large to a certain extent, namenode memory is too much.

III. Basic concepts of HDFS

Block: Large files are divided into multiple blocks for storage. The default block size is 64MB. Each block stores multiple copies on multiple datanodes, the default being 3 copies.

namenode: namenode is responsible for managing file directories, file and block mappings, and block and datanode mappings.

DataNode: DataNode is responsible for storage, of course, most fault tolerance mechanisms are implemented on the dataNode.

IV. HDFS Basic Architecture Diagram

There are several concepts in the picture that need to be introduced

Rack means cabinet, three copies of a block will usually be saved to two or more cabinets (of course, the servers in the cabinet), the purpose of this is to do disaster prevention fault tolerance, because the probability of a cabinet power loss or a cabinet switch hanging is still quite high.

V. HDFS file writing process

Thinking:

What policy does the namenode use to assign datanodes to clients after the datanodes execute create file?

Write three datanodes sequentially, and one datanode hangs during the writing process. How to fault tolerance?

Client hangs when writing data to datanode. How to fault tolerance?

VI. HDFS file reading process

Thinking: What should I do when namenode hangs? Does HDFS have a corresponding fault tolerance scheme at this time?

"What is the principle of HDFS" is introduced here. Thank you for reading it. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.