In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
1. The working mechanism of datanode 1. Basic process
1) after datanode starts, it registers with namenode according to the namenode address specified in the configuration file. 2) namenode returns successful registration 3) since then, datanode will periodically report all block information to namenode (default is 1 hour 4) at the same time, datanode will send heartbeat information to namenode every 3 seconds, and the heartbeat result returned by namenode will have the command namenode to the datanode, such as copying block data to another machine or deleting a data block. If you do not receive a heartbeat message from a datanode for more than 10 minutes (default), the node is considered unavailable. 5) you can safely join and exit some datanode machines during the cluster operation.
2. Basic directory structure
The directory structure of namenode is created by initializing hdfs namenode-format manually, while that of datanode is created automatically at startup without manual formatting. And even if the directory structure of namenode is formatted on datanode, these formatted directories are useless as long as namenode is not started in datanode. The general datanode directory is under ${hadoop.tmp.dir} / dfs/data. Look at the directory structure.
Data
├── current
│ ├── BP-473222668-192.168.50.121-1558262787574 named after poolID
│ │ ├── current
│ ├── dfsUsed
│ ├── finalized
│ └── subdir0
│ │ └── subdir0
│ │ ├── blk_1073741825
│ │ ├── blk_1073741825_1001.meta
│ │ ├── blk_1073741826
│ │ ├── blk_1073741826_1002.meta
│ │ ├── blk_1073741827
│ │ ├── blk_1073741827_1003.meta
│ ├── rbw
│ └── VERSION
│ │ ├── scanner.cursor
│ │ └── tmp
│ └── VERSION
└── in_use.lock
(1) the contents of / data/current/VERSION file are as follows:
# id of datanode, which is not globally unique and useless
StorageID=DS-0cb8a268-16c9-452b-b1d1-3323a4b0df60
# Cluster ID, globally unique
ClusterID=CID-c12b7022-0c51-49c5-942f-edc889d37fee
# creation time, it's useless
CTime=0
# unique identification code of datanode, globally unique
DatanodeUuid=085a9428-9732-4486-a0ba-d75e6ff28400
# Storage type is datanode
StorageType=DATA_NODE
LayoutVersion=-57
(2) / data/current/POOL_ID/current/VERSION
# ID of the interfacing namenode
NamespaceID=983105879
# create timestamp
CTime=1558262787574
# pool id used
BlockpoolID=BP-473222668-192.168.50.121-1558262787574
LayoutVersion=-57
(3) / data/current/POOL_ID/current/finalized/subdir0/subdir0 this directory is actually storing blocks of data. A block is mainly divided into two file stores:
Blk_$ {BLOCK-ID}
Blk_$ {BLOCK-ID} _ xxx.meta
For directories:
Blk_$ {BLOCK-ID}:
Is an xml format file that records operation logs similar to edits files, such as:
-63
OP_START_LOG_SEGMENT
twenty-two
OP_MKDIR
twenty-three
0
16386
/ input
1558105166840
Root
Supergroup
four hundred and ninety three
Blk_$ {BLOCK-ID} _ xxx.meta:
Is a raw G3 data, byte-padded format file, which mainly stores inode records in the directory
For files:
Blk_$ {BLOCK-ID}:
The actual data in block is recorded.
Blk_$ {BLOCK-ID} _ xxx.meta:
CRC32 check file to save check information for data blocks
3. Verify block integrity
1) when datanode reads the block, the accounting calculates its checksum. If it is different from the checksum when the block was created, it proves that the block on the current datanode has been corrupted. At this point, client will want to read the block from other datanode nodes that store the block. 2) after creating the block, datanode periodically checks the block for corruption, which is also achieved by checking the checksum.
4. Set the datanode timeout parameter to set the death of the datanode process, or if the datanode cannot communicate with the namenode due to a network failure, namenode will not immediately determine the datanode as dead, but if the datanode has no heartbeat information after a period of time, it will be judged as dead. The formula for calculating the timeout time is:
Timeout = 2 * dfs.namenode.heartbeat.recheck-interval + 10 * dfs.heartbeat.interval
Dfs.namenode.heartbeat.recheck-interval: the interval for namenode to check whether datanode is alive. The default is 5 minutes, in milliseconds.
The interval between uploading heartbeat information on dfs.heartbeat.interval:datanode. Default is 3 seconds, and unit is seconds.
Both are set in hdfs-site.xml
5. Multi-directory configuration of datanode
The multi-directory configuration of datanode is different from that of namenode. The data between multiple directories is different. Instead, the block data is divided into two parts, which are placed in two directories. The configuration is as follows:
/ / hdfs-site.xml
Dfs.datanode.data.dir
File:///${hadoop.tmp.dir}/dfs/data1,file:///${hadoop.tmp.dir}/dfs/data2
6. About the actual size of block
Although the size of each block is 128m (hadoop2.x), it still occupies 128m even if the actual size of the stored data is not 128m. But when it is actually stored on disk, it takes up the actual size of the data, not 128m. Because the block of the physical disk is 4KB by default, it is impossible to occupy 128m in vain.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.