Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Distcp distributed copy

2025-03-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

(1) discp principle

DistCp (Distributed Copy) is a high-performance copying tool used within or between large clusters, which is consistent with the implementation of cp,scp on linux, except that cp copies local files and directories to other parts of the machine, scp can copy files or directories from machine A to machine B, while Distcp can copy data from A (hdfs) cluster to B (hdfs) cluster. When data is copied, the distributed DN node of A group can send data to the DN node of B cluster at the same time, which breaks through the network card rate limit of stand-alone copy and makes the copy more efficient.

At the same time, Distcp uses Map/Reduce tasks for file distribution, error handling and recovery, and report generation. It takes the list of files and directories as input to the map task, and each task completes copying of some files in the source list. In fact, Distcp only uses map, not reduce.

(2) use scenarios

1: data remote disaster preparedness.

2: computer room offline, data migration.

3: quasi-real-time data synchronization.

(3) advantages of discp

1: bandwidth limit is supported. You can use the bandwidth parameter to limit the flow of each map task in distcp, and control the number of map concurrency to control the bandwidth of the entire copy task, preventing the copy task from filling up the bandwidth and affecting other businesses.

2: multiple source and destination verification copy methods such as overwrite (overwrite), update (incremental write) and delete (delete write) are supported. The copy of a large number of data must be checked during the data copy process to ensure the consistency of the source and destination data.

(4) discp command

Command format

Hadoop distcp\-Dmapred.jobtracker.maxtasks.per.job=1800000\ # maximum number of map tasks (data is divided into multiple map tasks)-Dmapred.job.max.map.running=4000\ # maximum map concurrency-Ddistcp.bandwidth=150000000\ # bandwidth-Ddfs.replication=2\ # replication factor Two copies-Ddistcp.skip.dir=$skipPath\ # filtered directories (directories that are not copied)-Dmapred.map.max.attempts=9\ # maximum attempts per task-Dmapred.fairscheduler.pool=distcp\ # specify the pool-pugp that the task runs\ # retain attributes (user, group Permissions)-I\ # ignore failed task-skipcrccheck\ # ignore CRC verification (prevent the task from failing due to inconsistent hdfs versions in the source and target cluster. ) hdfs://clusterA:9000/AAA/data\ # Source address hdfs://clusterB:9000/BBB/data # destination address

(5) execute output

[work@hq distcp] $hadoop distcp\-Dmapred.jobtracker.maxtasks.per.job=1800000\-Dmapred.job.max.map.running=4000\-Ddistcp.bandwidth=150000000\-Ddfs.replication=2\-Dmapred.map.max.attempts=9\-Dmapred.fairscheduler.pool=distcp\-pugp-I-skipcrccheck\ hdfs://clusterA:9000/AAA/data\ hdfs://clusterB:9000/BBB/data17/06/03 17:06:38 INFO tools.DistCp: srcPaths= [hdfs://clusterA:9000/AAA/data] 17ame06 / 03 17:06:38 INFO tools.DistCp: destPath=hdfs://clusterB:9000/BBB/data17/06/03 17:06:39 INFO tools.DistCp: config no skip dir17/06/03 17:06:40 INFO tools.DistCp: sourcePathsCount=24117/06/03 17:06:40 INFO tools.DistCp: filesToCopyCount=24017/06/03 17:06:40 INFO tools.DistCp: bytesToCopyCount=0.017/06/03 17:06:40 INFO tools.DistCp: mapTasks: 117-06-03 17:06:40 INFO corona.SessionDriver: My serverSocketPort 3682217/06/03 17:06:40 INFO corona.SessionDriver: My Address 10.160.115.122:3682217/06/03 17:06:40 INFO corona.SessionDriver: Connecting to cluster manager at jobtracker:802117/06/03 17:06:40 INFO corona.SessionDriver: HeartbeatInterval=1500017/06/03 17:06:40 INFO corona.SessionDriver: Got session ID job_201706031706_26727017/06/03 17:06:40 INFO tools.DistCp: targetsize=26843545617/06/03 17:06:40 INFO tools.DistCp: targetfiles=50017/06/03 17 : 06:40 INFO corona.SessionDriver: Started session job_201706031706_26727017/06/03 17:06:45 INFO mapred.JobClient: map 0 reduce 06reduce 03 17:06:59 INFO mapred.JobClient: map 3% reduce 06Unip 03 17:07:01 INFO mapred.JobClient: map 5% reduce 0Unip 03 17:07:05 INFO mapred.JobClient: map 6% reduce 0% .17 reduce 06 03 17:11:15 INFO mapred.JobClient: map 97 % reduce 0 06Accord 03 17:11:17 INFO mapred.JobClient: map 100% reduce 0Accord06Accord 03 17:11:25 INFO corona.SessionDriver: Stopping session driver

(6) main parameters

Hadoop 1 version

Distcp [OPTIONS] *

Options:

-p [rbugp] statu

R: copy number

B: block size

U: user

G: group

P: permission

T: modification and access time

-p alone is equivalent to-prbugpt

-I ignore failure

-when basedir copies files from, use it as the base directory

-log writes the log

-m maximum number of concurrent copies

-overwrite covers the destination

-update overrides if the src size is different from the dst size

-skipcrccheck do not use CRC checks to determine whether src is different from dest.

-copybychunk chopped and copied files

-f use the list in as the src list

-filelimit limits the total number of files to

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report