In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "what is hadoop distcp". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is hadoop distcp".
Overview
Distcp (distributed copy) is a tool for copying within and between large clusters. It uses Map/Reduce for file distribution, error handling and recovery, and report generation. It takes the list of files and directories as input to the map task, and each task completes copying of some files in the source list. Due to the use of the Map/Reduce method, this tool is special in both semantics and execution. This document provides a guide to common distcp operations and describes its working model.
Identity description remarks Preserve
R: replication number
B: block size
U: user
G: group
P: the number of permission modifications will not be retained. And when you specify ignore failures, as mentioned in the appendix, this option provides more accurate statistics about copies than the default, and it also keeps logs of failed copy operations, which can be used for debugging. Finally, if a map fails but does not complete all the attempts to chunk, it does not cause the entire job to fail. Log to DistCp logs each attempt to copy each file and logs it as the output of map. If a map fails, the log will not be retained when it is re-executed.
The maximum number of simultaneous copies specifies the number of map when the data is copied. Note that the greater the number of map, the greater the throughput.
Overwrite target if a map fails and is not used if the size of the source and destination is not the same, as mentioned earlier, this is not a "synchronization" operation. The only criterion for performing an overwrite is whether the source and destination files are the same size; if not, the source file replaces the destination file. As mentioned below, it also changes the semantics of the generation target path, and users should be careful when using it. Use as a source file list which is equivalent to listing all file names on the command line. The urilist_uri list should be a complete legal URI.
Update and overwrite
Here are some examples of-update and-overwrite. Consider a copy from / foo/an and / foo/b to / bar/foo, and the source paths include:
Hdfs://master1:8020/foo/a
Hdfs://master1:8020/foo/a/aa
Hdfs://master1:8020/foo/a/ab
Hdfs://master1:8020/foo/b
Hdfs://master1:8020/foo/b/ba
Hdfs://master1:8020/foo/b/ab
If the-update or-overwrite options are not set, both sources are mapped to / bar/foo/ab on the target side. If these two options are set, the contents of each source directory are compared with those of the destination directory. Distcp terminates the operation and exits when it encounters such a conflict.
By default, both the / bar/foo/an and / bar/foo/b directories are created, so there are no conflicts.
Now consider an operation that is legal using-update:
Distcp-update hdfs://master1:8020/foo/a\
Hdfs://master1:8020/foo/b\
Hdfs://master2:8020/bar
Where source path / size:
Hdfs://master1:8020/foo/a
Hdfs://master1:8020/foo/a/aa 32
Hdfs://master1:8020/foo/a/ab 32
Hdfs://master1:8020/foo/b
Hdfs://master1:8020/foo/b/ba 64
Hdfs://master1:8020/foo/b/bb 32
And destination path / size:
Hdfs://master2:8020/bar
Hdfs://master2:8020/bar/aa 32
Hdfs://master2:8020/bar/ba 32
Hdfs://master2:8020/bar/bb 64
Will result in:
Hdfs://master2:8020/bar
Hdfs://master2:8020/bar/aa 32
Hdfs://master2:8020/bar/ab 32
Hdfs://master2:8020/bar/ba 64
Hdfs://master2:8020/bar/bb 32
Only the aa file for master2 is not overwritten. If the-overwrite option is specified, all files are overwritten.
No. of Appendix Map
Distcp will try to evenly divide the content that needs to be copied so that each map copy is about the same size. However, because files are the smallest copy granularity, increasing the number of simultaneous copies (such as map) does not necessarily increase the number of actual simultaneous copies and the total throughput.
If the-m option is not used, distcp attempts to specify the number of map as min (total_bytes / bytes.per.map, 20 * num_task_trackers) when scheduling work, where bytes.per.map defaults to 256MB.
It is recommended that for jobs that run for a long time or run regularly, adjust the number of map according to the size of the source and destination cluster, the number of copies, and the bandwidth.
Hadoop distcp-Ddistcp.bytes.per.map=1073741824-Ddfs.client.socket-timeout=240000000-Dipc.client.connect.timeout=40000000-I-update hdfs://master1:8020/foo/a hdfs://master1:8020/foo/b hdfs://master2:8020/bar/foo
Copies between different HDFS versions
For copies between different Hadoop versions, users should use HftpFileSystem. This is a read-only file system, so distcp must run on the target cluster (more specifically, on a TaskTracker that can be written to the target cluster). The format of the source is h ftp:/// (the default dfs.http.address is: 50070).
Map/Reduce and side effects
As mentioned earlier, when map fails to copy the input file, it will have some side effects.
Unless-I is used, the log generated by the task will be replaced by a new attempt.
Unless-overwrite is used, the file will be marked as "ignored" when the copy is performed again after it was successfully copied by the previous map.
If map fails mapred.map.max.attempts times, the remaining map tasks are terminated (unless-I is used).
If mapred.speculative.execution is set to final and true, the result of the copy is undefined.
Thank you for your reading, the above is the content of "what is hadoop distcp", after the study of this article, I believe you have a deeper understanding of what hadoop distcp is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.