In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
1 HDFS Overview 1.1 HDFS output background and definition
With the increasing amount of data, all the data cannot be stored in one operating system, so it will be allocated to more disks managed by the operating system, but it is not convenient to manage and maintain, so there is an urgent need for a system to manage files on multiple machines. This is the distributed file management system. HDFS is just one of the distributed file management systems.
HDFS (Hadoop Distributed File System), which is a file system, is used to store files and locate files through the directory tree; secondly, it is distributed and many servers join together to achieve its functions, and the servers in the cluster have their own roles.
HDFS usage scenario: it is suitable for scenarios that are written once and read multiple times, and does not support file modification. Suitable for data analysis, but not suitable for network disk applications.
1.2 advantages and disadvantages of HDFS
Advantages:
Multiple copies of highly fault-tolerant data are automatically saved. It improves fault tolerance by adding the form of replicas, after a copy is lost, it can automatically recover suitable for processing big data can be built on a cheap machine, through the multi-copy mechanism to improve reliability.
Disadvantages:
It is not suitable for low latency data access, for example, millisecond storage data cannot efficiently store a large number of small files, does not support concurrent writes, and files are randomly modified to form a 1.3 HDFS architecture.
1.4 HDFS file block size
Files in HDFS are physically partitioned storage (Block), and the size of the block can be specified by the configuration parameter (dfs.blocksize). The default size is 128m in the Hadoop2.x version and 64m in the previous version.
If the addressing time is 100ms, the time to find the target Block is 100ms.
The best condition is that the ratio of addressing time to transmission time is 100: 1, so the transmission time is 1ms.
At present, the transfer rate of the disk is about 100MB/s, and the whole is about 128MB.
2 Shell operation of HDFS
(1)-help: output this command parameter
(2)-ls: displays directory information
(3)-mkdir: create a directory on HDFS
(4)-moveFromLocal: cut and paste from local to HDFS
(5)-appendToFile: append a file to the end of an existing file
(6)-cat: displays the contents of the file
(7) modify the permissions to which the file belongs as used in the file system of-chgrp,-chmod,-chown:Linux
(8)-copyFromLocal: copy files from the local file system to the HDFS path
(9)-copyToLocal: copy from HDFS to local
(10)-cp: copy from one path of HDFS to another path of HDFS
(11)-mv: move files in the HDFS directory
(12)-get: equivalent to copyToLocal, that is, downloading files from HDFS to local
(13)-getmerge: merge and download multiple files. For example, there are multiple files under the directory / user/djm/test of HDFS: log.1, log.2,log.3,...
(14)-put: equivalent to copyFromLocal
(15)-tail: displays the end of a file
(16)-rm: delete a file or folder
(17)-rmdir: delete empty directory
(18)-du: size information of statistical folders
(19)-setrep: sets the number of copies of files in HDFS
3 HDFS client operation package com.djm.hdfsclient;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.*;import org.junit.After;import org.junit.Before;import org.junit.Test;import java.io.IOException;import java.net.URI;public class HdfsClient {FileSystem fileSystem = null @ Before public void init () {try {fileSystem = FileSystem.get (URI.create ("hdfs://hadoop102:9000"), new Configuration (), "djm");} catch (IOException e) {e.printStackTrace ();} catch (InterruptedException e) {e.printStackTrace () }} / * upload files * / @ Test public void put () {try {fileSystem.copyFromLocalFile (new Path ("C:\ Users\\ Administrator\\ Desktop\\ Hadoop getting started .md"), new Path ("/"));} catch (IOException e) {e.printStackTrace () }} / * download file * / @ Test public void download () {try {/ / useRawLocalFileSystem indicates whether to enable file verification fileSystem.copyToLocalFile (false, new Path ("/ Hadoop getting started .md"), new Path ("C:\ Users\\ Administrator\\ Desktop\\ Hadoop getting started 1.md"), true) } catch (IOException e) {e.printStackTrace ();}} / * * delete file * / @ Test public void delete () {try {/ / recursive indicates whether to delete fileSystem.delete recursively (new Path ("/ Hadoop getting started .md"), true) } catch (IOException e) {e.printStackTrace ();}} / * File rename * / @ Test public void rename () {try {fileSystem.rename (new Path ("/ tmp"), new Path ("/ temp"));} catch (IOException e) {e.printStackTrace () }} / * View file information * / @ Test public void ls () {try {RemoteIterator listFiles = fileSystem.listFiles (new Path ("/ etc"), true); while (listFiles.hasNext ()) {LocatedFileStatus fileStatus = listFiles.next () If (fileStatus.isFile ()) {/ / only output file information System.out.print (fileStatus.getPath (). GetName () + "+ fileStatus.getLen () +" + fileStatus.getPermission () + "+ fileStatus.getGroup () +") / / get file block information BlockLocation [] blockLocations = fileStatus.getBlockLocations (); for (BlockLocation blockLocation: blockLocations) {/ / get node information String [] hosts = blockLocation.getHosts () For (String host: hosts) {System.out.print (host + "");}} System.out.println ();} catch (IOException e) {e.printStackTrace () } @ After public void exit () {try {fileSystem.close ();} catch (IOException e) {e.printStackTrace ();} 4 HDFS data flow 4.1 HDFS write data flow 4.1.1 profile file write
1. The client requests to upload a file to NameNode through the Distributed FileSystem module, and NameNode checks whether the target file already exists and the parent directory exists.
2. NameNode returns whether it can be uploaded.
3. Which DataNode does the client request the first Block to upload to?
4. NameNode returns three nodes, which are dn1, dn2 and dn3.
5. The client requests dn1 to upload data through the FSDataOutputStream module. When dn1 receives the request, it will continue to call dn2, and then dn2 will call dn3 to complete the communication pipeline.
6. respond to the client step by step in reverse order.
7. The client starts uploading the first Block to dn1 (first reading data from disk to a local memory cache). In Packet, dn1 receives a Packet and passes it to dn2,dn2. Each packet is sent to dn3;dn1 and put into a response queue to wait for a reply.
8. When a Block transfer is completed, the client again requests NameNode to upload the server of the second Block.
4.1.2 Network Topology-Node distance calculation
In the process of writing data by HDFS, NameNode will select the DataNode that is closest to the data to be uploaded to receive the data. So how to calculate the nearest distance?
4.1.3 rack awareness
4.2 HDFS read data flow
1. The client requests NameNode to download the file through Distributed FileSystem, and NameNode finds the DataNode address where the file block is located by querying metadata.
2. Select a DataNode according to the nearest principle and request to read the data.
3. DataNode begins to transfer data to the client.
4. The client receives it in Packet, caches it locally, and then writes to the target file.
5 working mechanism of NameNode and SecondaryNameNode5.1 NN and 2NN
If it is stored in the disk of a NameNode node, it must be inefficient because it often requires random access and responds to customer requests, so metadata must be stored in memory. As we all know, memory is characterized by high speed, and data is lost after a power outage. Once the power is off, the metadata is lost, and the whole cluster cannot work. As a result, FsImage is used to back up metadata.
But this will also give rise to a question: when the metadata in memory is updated, FsImage should be updated at the same time. If it is updated synchronously, it will lead to low efficiency. If it is not updated synchronously, it will lead to data consistency problems. Once the power is suddenly cut off, part of the data will be lost. Therefore, Edits (only append operation, very efficient) is introduced. Whenever metadata is added or modified, it will be appended to Edits first. The memory is being modified so that the metadata can be synthesized through FsImage and Edits once the power is off.
However, if you add data to Edits for a long time, resulting in too large Edits, it will still affect efficiency, and once the power is off, the time to recover metadata will increase accordingly. Therefore, it is necessary to merge FsImage and Edits regularly. If this operation is completed by NameNode, it will be inefficient (because write requests cannot be processed after the merge), so SecondaryNameNode is introduced to do the merge operation.
NameNode work:
1. After starting NameNode format for the first time, create Fsimage and Edits files. If it is not the first time, load the editing log and image file directly into memory.
2. The client adds, deletes and modifies the metadata.
3. NameNode records the operation log and updates the rolling log.
4. NameNode adds, deletes and modifies metadata in memory.
Secondary NameNode work:
1. Secondary NameNode asks whether NameNode needs CheckPoint, and directly returns NameNode to check the result.
2. Secondary NameNode requests to execute CheckPoint.
3. NameNode scrolls the Edits log that is being written.
4. Copy the editing log and image file before scrolling to Secondary NameNode.
5. Secondary NameNode loads editing logs and image files into memory merge.
6. Generate a new image file fsimage.chkpoint.
7. Copy fsimage.chkpoint to NameNode.
8. NameNode renames fsimage.chkpoint to fsimage.
5.2 Fsimage and Edits parsing
Oiv View Fsimage Fil
Hdfs oiv-p file type-I image file-o converted file output path
Oev View Edits Fil
Hdfs oev-p file type-I edit log-o converted file output path 5.3 CheckPoint time setting
The merge operation is triggered if both of the following conditions are met:
Normally, SecondaryNameNode is executed every other hour.
[hdfs-default.xml]
Dfs.namenode.checkpoint.period3600
Check the number of operations once a minute, and when the number reaches 1 million, the operation will be triggered.
[hdfs-default.xml]
Dfs.namenode.checkpoint.txns 1000000 number of operation actions dfs.namenode.checkpoint.check.period 60 1 minute check number of operations 5.4 NameNode fault handling
After a NameNode failure, you can recover data in the following two ways:
Copy the data from 2NN to the directory where NN stores the data.
Start the NN daemon with the-importCheckpoint option to copy the data from 2NN to the NN directory.
Hdfs namenode-importCheckpoint5.5 cluster security mode
Basic commands:
Hdfs dfsadmin-safemode get: view the status of security mode
Hdfs dfsadmin-safemode enter: enter the safe mode state
Hdfs dfsadmin-safemode leave: leaving the safe mode state
Hdfs dfsadmin-safemode wait: waiting for safe mode status
6 working mechanism of DataNode6.1 DataNode
1. A data block is stored on disk as a file on DataNode, including two files, one is the data itself, and the other is the metadata including the length of the data block, the checksum of the block data, and the timestamp.
2. Register with NameNode after DataNode is started, and report all block information to NameNode periodically (1 hour) after passing.
3. The heartbeat is every 3 seconds, and the heartbeat returns the result with the command given to the DataNode by NameNode. For example, if you copy block data to another machine or delete a data block, if you do not receive a heartbeat from a DataNode for more than 10 minutes, the node is considered unavailable.
4. You can safely join and exit some machines when the cluster is running.
6.2 data integrity
1. When DataNode reads Block, it calculates CheckSum.
2. If the calculated CheckSum is different from the value when the Block was created, the Block has been corrupted.
3. Client reads Block on other DataNode.
4. Periodic verification after the creation of the file.
6.3 setting of offline time limit parameter
[hdfs-site.xml]
Dfs.namenode.heartbeat.recheck-interval 300000 milliseconds dfs.heartbeat.interval 3.6.4 serving a new data node
Send the java, hadoop and profile on the hadoop102 to the new host, source the profile, and start it directly to join the cluster.
6.5 decommissioning old data nodes 6.5.1 blacklist settings
Create a blacklist
[djm@hadoop101 hadoop] $touch blacklist
Configure hosts that are blacklisted
Hadoop102
Configure hdfs-site.xml
Dfs.hosts.exclude / opt/module/hadoop-2.7.2/etc/hadoop/blacklist
Refresh namenodes
[djm@hadoop102 hadoop-2.7.2] $hdfs dfsadmin-refreshNodes
Update ResourceManager node
[djm@hadoop102 hadoop-2.7.2] $yarn rmadmin-refreshNodes
If the data is not balanced, you can use commands to rebalance the cluster.
[djm@hadoop102 hadoop-2.7.2] $start-balancer.sh6.5.2 whitelist setting
Create a whitelist
[djm@hadoop101 hadoop] $touch whitelist
Configure hosts that are blacklisted
Hadoop102hadoop103hadoop104
Configure hdfs-site.xml
Dfs.hosts / opt/module/hadoop-2.7.2/etc/hadoop/whitelist
Refresh namenodes
[djm@hadoop102 hadoop-2.7.2] $hdfs dfsadmin-refreshNodes
Update ResourceManager node
[djm@hadoop102 hadoop-2.7.2] $yarn rmadmin-refreshNodes
If the data is not balanced, you can use commands to rebalance the cluster.
[djm@hadoop102 hadoop-2.7.2] $start-balancer.sh
The difference between black and white lists:
The whitelist is relatively strict, and the blacklist is relatively smooth. The host in the blacklist continues to be in the cluster after synchronizing the data, but is not processing the request, while the host that is not in the whitelist is killed directly.
6.6 Datanode multi-directory configuration
DataNode can also be configured into multiple directories, each of which stores different data, that is, the data is not a copy.
Hdfs-site.xml
Dfs.datanode.data.dir file:///${hadoop.tmp.dir}/dfs/data1,file:///${hadoop.tmp.dir}/dfs/data27 HDFS 2.x New feature 7.1 Inter-cluster data copy
Using distcp command to realize recursive data replication between two Hadoop clusters
[djm@hadoop102 hadoop-2.7.2] $hadoop distcp hdfs://haoop102:9000/user/djm/hello.txt hdfs://hadoop103:9000/user/djm/hello.txt7.2 small file archive
Archived document
[djm@hadoop102 hadoop-2.7.2] $hadoop archive-archiveName input.har-p / user/djm/input / user/djm/output
View Archiv
[djm@hadoop102 hadoop-2.7.2] $hadoop fs-lsr har:///user/djm/output/input.har
Unarchiving files
[atguigu@djm hadoop-2.7.2] $hadoop fs-cp har:/// user/djm/output/input.har/* / user/djm
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.