In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Saturday, 2019-2-16
Basic concepts of hdfs (design ideas, features, working mechanism, upload and download namenode storage metadata mechanism)
1. The general design idea of hdfs:
Design goal: to improve the efficiency of distributed concurrent processing of data (to improve concurrency and move operations to data)
Divide and conquer: large files and large quantities of files are distributed and stored on a large number of independent servers to facilitate the operation and analysis of massive data in a divide-and-conquer way.
Key concepts: file fragmentation, copy storage, metadata, location query, data read and write stream
2. Shell operation of hdfs / / see the separate document of the response.
3. Some concepts of hdfs
The basic working mechanism and related concepts of Hdfs distributed file system / / see drawing
First of all, it is a file system with a unified namespace, the directory tree, which is when the client accesses the hdfs file.
By specifying the path in this directory tree
Secondly, it is distributed and functions are implemented by many servers.
The hdfs file system provides clients with a unified abstract directory tree, and the files in Hdfs are all block.
The size of the stored block can be specified by the configuration parameter (dfs.blocksize). The default size is in the hadoop2.x version.
128m in the middle and 64m in the old version
Who actually stores the block of files? -distributed across datanode service nodes, while
And each block can store multiple copies (the number of copies can also be set by parameter dfs.replication, default
The value is 3)
There is an important role in Hdfs: namenode, which maintains the directory tree of the entire hdfs file system, as well as each
Block block information corresponding to a path (file) (id of block and datanode server)
hdfs is designed to adapt to write-once, read-out scenarios, and does not support file modification.
(hdfs is not suitable for network disk applications because it is inconvenient to modify, large delay, high network overhead and high cost.)
The definition and concept of hdfs slice
1: define a slice size: it can be adjusted by parameter, which is equal to "blocksize set in hdfs" by default, usually 128m.
2: get all the pending files List under the input data directory
3: traverse the file List and slice it one by one
For (file:List)
File is cut from 0 offset to form a slice every 128m, such as a.txt (200m), it will be cut into two slices: a.txt: 0-128m, a.txt: 128M-256M
For example, b.txt (80m) will be cut into a slice, b.txt: 0-80m
HDFS Block replication strategy
-the first copy is placed on the node where the client is located
If it is a remote client, block will randomly select the node
The system will first select the idle DataNode node
-the second copy is on a different rack node
-the third copy is placed on a different machine on the same rack as the second copy
-good stability, load balancing, good write bandwidth, read performance, uniform block distribution
-Rack awareness: distribute copies to different racks to improve high fault tolerance of data
-take the node as the backup object
4. Characteristics:
Capacity can be expanded linearly
High reliability of data storage
Distributed computing processing is very convenient.
The data access delay is large, and the data modification operation is not supported.
It is suitable for application scenarios with one write and multiple reads.
5. The working mechanism of hdfs
HDFS cluster is divided into two major roles: NameNode and DataNode
NameNode is responsible for managing the metadata of the entire file system
DataNode is responsible for managing users' file blocks.
6. The working mechanism of namenode
Namenode responsibilities:
1. When responding to client request / / when the client requests hdfs, it will go to namenode first.
2. Maintain the directory tree / / when the client reads or writes files, he will specify a directory, which is the directory of hdfs and is managed by namenode.
3. Manage metadata (query, modify) *
/ / what is metadata
File description: how many block are there for a file in a certain path, and each block is stored on those datanode? what is the number of copies of a file? This information is metadata, which is very important and cannot be lost or wrong, so it may not be available when the client requests it.
Tip: a complete copy of metadata is stored in memory, including the directory tree structure and the mapping between files and data blocks and copy storage.
7. The working mechanism of datanode
1. Datanode job responsibilities:
2. Store and manage the file block data of users
3. Report your block information to namenode regularly (through heartbeat information)
4. Upload a file and observe the physical storage of the block of the file
This directory on each datanode machine:
/ home/hadoop/app/hadoop-2.4.1/tmp/dfs/data/current/BP-193442119-192.168.2.120-1432457733
977/current/finalized
Monday, 2019-2-18
Hdfs write data flow (put)
1. The root namenode communicates with the request to upload the file. Namenode checks whether the target file already exists and the parent directory exists.
2. Namenode returns whether it can be uploaded.
3. Client requests which datanode servers the first block should be transferred to.
4. Namenode returns 3 datanode server ABC
5. Client requests one of the three dn's A to upload data (essentially a RPC call to establish a pipeline). A will continue to call B when receiving the request, and then B will call C to complete the establishment of the real pipeline and return it to the client step by step.
6. Client starts uploading the first block to A (first reading data from disk to a local memory cache). Taking packet as a unit, A receives a packet and passes it to Bmai B and Cten A. each packet is put into a response queue to wait for a reply.
7. When a block transfer is complete, client again requests namenode to upload the server of the second block.
Hdfs read data flow (get)
1. Communicate with namenode to query metadata and find the datanode server where the file block is located.
2. Select a datanode (nearest principle, then random) server and request to establish a socket stream
3. Datanode starts to send data (read data from disk and put it into stream, and verify it in packet)
4. The client receives it in packet, now caches it locally, and then writes to the target file.
Summary:
What we describe here is that the process of reading and writing data in hdfs is relatively smooth, and exceptions may occur in each of the above stages. Hdfs is also very perfect for each exception, and the fault tolerance is very high. The logic of handling these exceptions is very complex, so we will not go into detail for the time being and understand the normal reading and writing process on ok.
The mechanism of namenode managing metadata in Hdfs / / CheckPoint of metadata
As shown in the figure:
How is hdfs metadata stored?
A. there is a complete metadata (specific data structure) in memory.
B. the disk has a mirror file of "quasi-complete" metadata.
C. When the client adds or modifies the files in hdfs, the operation log is first recorded in the edits file, and when the client operation is successful, the corresponding metadata is updated to memory; every once in a while, secondary namenode downloads all edits accumulated on namenode and a new fsimage locally, and loads it into memory for merge (this process is called checkpoint)
D, trigger condition configuration parameters of checkpoint operation:
The frequency of dfs.namenode.checkpoint.check.period=60 # checking whether the trigger condition is met, 60 seconds
Dfs.namenode.checkpoint.dir= file://${hadoop.tmp.dir}/dfs/namesecondary
# when the above two parameters perform checkpoint operation, the local working directory of secondary namenode
Dfs.namenode.checkpoint.edits.dir=$ {dfs.namenode.checkpoint.dir}
Dfs.namenode.checkpoint.max-retries=3 # maximum retries
Dfs.namenode.checkpoint.period=3600 # the interval between two checkpoint is 3600 seconds
Dfs.namenode.checkpoint.txns=1000000 # the largest operation record between two checkpoint
The working directory storage structures of E, namenode and secondary namenode are exactly the same, so when namenode failure exit requires recovery, fsimage can be copied from the working directory of secondary namenode to the working directory of namenode to recover the metadata of namenode.
F. You can view the information in edits through a tool of hdfs
Bin/hdfs oev-I edits-o edits.xml
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.