Good programmer big data Learning Route sharing HDFS Learning Summary 07/06 Update SLTechnology News&Howtos

Good programmer big data Learning Route sharing HDFS Learning Summary

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Good Programmer Big Data Learning Route Sharing HDFS Learning Summary, HDFS Introduction

HDFS(Hadoop Distributed File System) is a distributed file system that is a core subproject of the Hadoop project.

Design idea: large files, large quantities of files, distributed storage in a large number of servers, in order to take a divide-and-rule approach to massive data analysis.

Important Features of HDFS

1. Files in HDFS are physically stored in blocks. The block size can be specified by configuration parameter ( dfs.blocksize). The default size is 128M in hadoop 2.x version and 64M in old version.

2. HDFS file system provides clients with a ** unified abstract directory tree **, and clients access files through paths.

3. ** Management of directory structure and file block information (metadata)** is undertaken by namenode node

4. The storage management of each block of the file is undertaken by the datastore node

5. HDFS is designed to accommodate write-once, read-many scenarios and does not support file modification

Advantages of HDFS

1. high reliability

Hadoop's ability to store and process data bitwise

2. high scalability

Hadoop distributes data among clusters of available computers to complete computational tasks.

3. efficiency

Hadoop can dynamically move data between nodes and ensure dynamic balance of each node

4. high fault tolerance

Hadoop can automatically save multiple copies and automatically redistribute failed tasks

Disadvantages of HDFS

1. Not suitable for low latency access, not fast access

HDFS is a single master, all requests for files have to go through it, when the request is large, there will definitely be delay. It is suitable for high-throughput scenarios where a large amount of data is written at a time

2. Unable to store large numbers of small files efficiently

Storing a large number of small files takes up a lot of NameNode memory to store file, directory, and block information (metadata).

3. No support for multi-user writing, i.e. arbitrary file modification

Only data append is supported, file modification is not supported.

Possible improvement measures for HDFS shortcomings

1. Multi-Master design, GFS II under development is also to be changed to distributed multi-Master design, also supports Master Failover, and Block size is changed to 1M, intentionally to optimize the handling of small files. (Alibaba DFS is also a multi-master design, which separates the mapping storage and management of Metadata and consists of multiple Metadata storage nodes and a query Master node.)

2. Using caching or multi-master designs can reduce the pressure of client data requests to reduce latency.

3. Scale-out, a Hadoop cluster can manage limited small files, then drag several Hadoop clusters behind a virtual server to form a large Hadoop cluster. Google did the same thing.

Shell command for HDFS

| **-help** Manual for outputting commands|

| :----------------------------------------------------------- |

| **-ls** Display directory information `hadoop fs -ls hdfs://hadoop-server01:9000/` ps: All hdfs** paths in these parameters can be abbreviated-->`hadoop fs -ls /` Equivalent to the effect of the previous command|

| **-put** Upload files to HDSF `hdfs dfs -put local file path HDFS file system path` **|

| **-get** Download files from HDFS file system back `hdfs dfs -get HDFS file system path local file system path` ** ps: HDFS has a method similar to put and get copyFromlocal equivalent to put and copyTolocal equivalent to get|

| **-cat** View file contents in HDFS file system `hdfs dfs -cat Path to files in HDFS file system` ps: Do not view non-files|

| **-cp** Copy operation in HDFS file system `hdfs dfs -cp file path in source HDFS file system path in destination HDFS file system`|

| **-mv** Move files in HDFS file system `hdfs dfs -mv file path in source HDFS file system path in destination HDFS file system` ps: Move source files to destination path, this command allows multiple source paths, in this case the destination path must be a folder (directory) Different file systems are not allowed to move files from each other|

| **-du** View file size in HDFS file system `hdfs dfs -du A file in path in HDFS file system`|

| **-mkdir** Create folder mkdir in HDSF system Create folder ps: recursively create +`-p`|

| **-rm** Delete directories or files in HDFS file system `hdfs dfs -rm HDFS file system path` `hdfs dfs -rm -r HDFS file system path` ps: can only be a single file or empty directory, if there are multiple files in the parameter folder plus-r|

| **-chmod** Change file permissions `hdfs dfs -chmod -R permissions folder` ps: all three bits can be processed as an octal 777 is full permissions rwx,+R after, all subfiles and folders under the folder will be modified|

| **-appendTofile** appends a file to the end of an existing file `hadoop fs -appendTofile ./ hello.txt /hello.txt` |

| **-getmerge** Merge and download multiple files `hadoop fs -getmerge /aaa/log.* ./ log.sum` |

| **-df** Statistics available space information of file system `hadoop fs -df -h /`|

| |

How HDFS works

Before we understand how it works, let's look at a few important roles:

NameNode

1. master, which is a manager who maintains the directory tree of the entire file system

2. Store database (Block) mapping information, save metadata information including: file ownership, file permissions, file size, time (Block list,Block offset), location information

3. Main responsibilities: process client read/write requests, collect Block list information reported by DateNode

**DateNode**

1. Slave, it is a slave node, simple understanding is NameNode slave

2. Store user's file block data

3. Main responsibilities: regularly report the block information held by itself to NameNode (heartbeat mechanism), and execute read/write operations of data blocks

Secondary NameNode

1. Checkpoint node. On the surface, Secondary NameNode is a backup of NameNode. In fact, Secondary NameNode's main function is not backup.

2. Main responsibilities: Merge fsimage and edit log regularly and push to NameNode

Problem introduction: A cluster that has been running for ten years, the most recent fsimage (mirror) is NameNode formatted ten years ago, so many years of operation logs have been edited log (log) records have reached hundreds of T. So the problem is, if I want to restart this cluster, such a large log file, must restart for a long time, we do not have enough time how to solve this problem?

In this case, we introduce the Secondary NameNode, and merge the edit log and fsimage in PN into SN. At this time, PN will continue to generate new logs, recording the operations during the merge and merge. After the merge, the new fsimage will be copied back to PN, and the above operations will be repeated. This keeps the edit log file small, and the fsimage point in time is not too far away.

** How does Fsimage come about? **

HDFS system to start running time needs to format the NameNode first, then the first format will produce a fsimage file, but this file is an empty file,NameNode startup will load fsimage and then execute edit log load into memory, and then immediately write a new fsimage file to disk, this fsimage is a new storage information

** In case of emergency, it can assist in restoring NameNode**

The working directory storage structure of namenode and secondary namenode is exactly the same, so when namenode fails and needs to be restored, fsimage can be copied from the working directory of secondary namenode to the working directory of namenode to restore the metadata of namenode.

HDFS Read and Write Data Flow

HDFS Read Data Flow

simple version

The client sends the path of the file to be read to the namenode, and the namenode obtains the meta information of the file (mainly the storage location information of the blocks) and returns it to the client. The client finds the corresponding datanode according to the returned information, obtains the blocks of the file one by one, and locally appends and merges the data to obtain the whole file.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.