HDFS Learning Summary (Hadoop2.xx version) v1.2 04/11 Update SLTechnology News&Howtos

HDFS Learning Summary (Hadoop2.xx version) v1.2

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Hadoop2.xx cluster building: http://bigtrash.blog.51cto.com/8966424/1830423

1.HDFS (Hadoop Distributed File System): hadoop distributed file system consisting of a NameNode that manages file system metadata (metadata) and a DataNode that stores actual data.

Support linear growth of mass storage capacity through distributed storage mechanism

Automatic data redundancy without backup through RAID

Write once, read many times, cannot be modified, simple and consistent. Append is not supported in 1.xx version.

Assign nodes to execute according to the principle of "data nearest"

2.NameNode: manages the namespace of hdfs

Save metadata information:

File owership and permissions

The file contains those Block

Which DataNode is the Block saved in?

The metadata information of NameNode is stored in fsp_w_picpath on disk and loaded into memory after hdfs startup.

The in-memory NameNode operation information is saved to the edits log on disk.

Safe mode phase when NameNode starts:

1) No data writing is generated

2) collect DataNode reports, when the data block reaches the minimum number of copies, it will be regarded as secure; when the security block reaches a certain proportion, exit the security mode after a certain period of time

3) for data blocks with insufficient number of copies, automatically replicate to the minimum number of copies

3.DataNode: the real place where data is stored

Stored as Block, with a default Block of 128m per block, which is larger than the block block of most current file systems, but does not take up so much space on physical storage

Each block is divided into at least three (default) DataNode, and the number of replicas can be adjusted. Its copy mechanism improves reliability and read throughput.

When the DataNode thread is started, it traverses the local file system, generates a list of correspondence between hdfs blocks and local files, and reports it to NameNode.

4.Secondary NameNode:

When NameNode starts, it needs to merge fsp_w_picpath (the latest state of HDFS) and edits (after fsp_w_picpath creation) on the hard disk, which takes a lot of startup time. Here, Secondary NameNode periodically downloads namenode's fsp_w_picpath file and edits file for merging, enables a new edits on namenode, then overwrites namenode with the merged fsp_w_picpath upload, and opens a new edits log, reducing the startup time of hdfs.

5.Checkpoint node: probably in order to avoid name confusion, it is recommended to use CheckPoint Node instead of Secondary NameNode after version 1.0.4. The function and configuration are basically the same.

Use bin/hdfs namenode-checkpoint to start

When starting NameNode-importCheckpoint, you can import namenode from checkpioint

Common profile:

Dfs.namenode.checkpoint.period # time when a merge is triggered by edits

Fs.checkpoint.size # edits triggers the size threshold of a merge

Dfs.namenode.checkpoint.dir # specify the saving strength of the secondary namenode

Dfs.namenode.checkpoint.edits.dir # specify the edits save path for secondary namenode

Full backup of 6.Backup Node:namenode. In addition to providing Checkpoint functionality, you also copy the Namespace in master memory into your own memory. Backup Node not only receives the edits on NameNode and saves it to disk, but also applies edits to its own Namespace memory copy to establish a full backup of Namespace.

Currently, hdfs only supports one backup node. After using backup, you cannot use checkpoing.

Location of dfs.backup.address # backup node

Dfs.backup.http.address # backup node web interface address

Run bin/hdfs namenode-checkpoint on the node configured by dfs.backup.address

7. Policy for placing three copies of Block:

On the node of the copy 1:client

Copy 2: on nodes in different racks

Copy 3: on another node on the same rack as copy 2

8.DataNode damage handling mechanism:

1. When reading Block from DataNode, it calculates checksum

two。 The calculated checksum is different from the value when it was created, indicating that the block has been corrupted

3.client reads the block,NameNode on other DataNode to mark the block as corrupted, and then copies the block to the preset number of copies (usually three weeks after the file is created (default) to verify its checksum)

How to access 9.HDFS:

HDFS SHELL command

HDFS JAVA API

HDFS REST API

HDFS FUSE

HDFS lib hdfs:C/C++ access interface

Thrift

. . .

Shell:hadoop fs-help command of interaction class between 10.Hadoop and HDFS

Hadoop fs-ls PATH: view the contents of the specified directory

Hadoop fs-cat PATH/FILE: view FILE

Hadoop fs-put LOCAL_PATH/FILE HADOOP_PATH: store local files to hadoop

Hadoop fs-put LOCAL_PATH HADOOP_PATH: store local folders to hadoop

Hadoop fs-rm PATH/FILE: delete FILE

Hadoop fs-rmr PATH: delete all files under PATH

Hadoop fs-mkdir PATH: create a new directory

Hadoop fs-touchz PATH/FILE: create a file

Hadoop fs-mv PATH/OLD_FILE PATH/NEW_FILE: file rename

11.hdfs management

Hadoop-deamon.sh start namenode # launch namenode

Hadoop-deamon.sh start datanode # launch datanode

Hdfs dfsadmin-help: get some operations managed by hdfs

-report: report basic statistics of HDFS

-safemode: safe mode

-finalizeUpgrade: remove the backup before the last upgrade of the cluster

-refreshNodes: reread the hosts and exclude files (specified in the dfs.hosts.exclude option), and NameNode re-identifies changes to the node

-printTopology: displays the topology of the cluster

-upgradeProgress status/details/force: display upgrade status / details of upgrade status / force upgrade operation

Hadoop fsck: file management

Check that the files in this directory are complete

-move damaged files are moved to the / lost+found directory

-delete deletes damaged files

-openforwrite prints a file that is opening a write operation

-files prints the name of the file being check

-blocks prints the block report (to be used with the-files parameter)

-locations prints location information for each block (to be used with the-files parameter)

-racks prints a network topology diagram of location information (to be used with the-files parameter)

Start-balancer.sh-threshold: block redistribution

You can view the status of NameNode and DataNode in the cluster through the Web service.

Http://master:50070/: displays the current basic status of the cluster

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.