XVI. MapReduce-- tuning 04/16 Update SLTechnology News&Howtos

XVI. MapReduce-- tuning

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

First, the reason why MapReduce runs slowly.

1) computer performance

CPU, memory, disk health, network.

Atime is not updated when the file system can set file access

2) Operation optimization of IPUBO

(1) data tilt

(2) the setting of map and reduce number is unreasonable.

(3) map takes too long to run, which causes reduce to wait too long.

(4) too many small documents

(5) A large number of super-large files that cannot be divided into blocks

(6) too many times of spill

(7) excessive merge times and so on.

II. Optimization scheme

MapReduce optimization method is mainly considered from several aspects: data input, Map phase, Reduce phase, IO transmission, data tilt.

1. Data input phase

1) merge small files: merge small files before executing mr tasks. A large number of small files will generate a large number of map tasks, increasing the loading times of map tasks, but the loading of tasks is time-consuming, resulting in slow mr running.

2) CombineTextInputFormat is used as input to solve a large number of small file scenarios on the input side.

2. Map stage

1) reduce the number of overwrites (spill): by adjusting the values of mapreduce.task.io.sort.mb and mapreduce.map.sort.spill.percent parameters, increase the upper limit of memory for triggering spill and reduce the number of spill, thus reducing the number of disk IO.

2) reduce the number of merge: by adjusting the mapreduce.task.io.sort.factor parameters, increase the number of files opened at the same time when merge, reduce the number of merge, thus shorten the mr processing time. In essence, it is to increase the number of open file handles.

3) after map, under the premise that the business logic is not affected, the combine processing is carried out first to reduce the Imax O.

3. Reduce stage

1) set the number of map and reduce reasonably: neither of them can be set too little or too much. Too little will cause task to wait and prolong the processing time; too much will lead to competition for resources between map and reduce tasks, resulting in errors such as processing timeout.

2) set the coexistence of map and reduce: adjust the mapreduce.job.reduce.slowstart.completedmaps parameter. Default is 0.05. It means that in at least 5% of the maptask completed in the map task, reducetask is started. When you have finished getting map to run to a certain extent, reduce also starts running, reducing the wait time for reduce.

3) avoid using reduce: because reduce will incur a lot of network consumption when it is used to connect datasets.

4) set the buffer on the reduce side reasonably: by default, when the data reaches a threshold, the data in buffer will be written to disk, and then reduce will get all the data from disk. In other words, buffer and reduce are not directly related, and there is a process of writing disk-> reading disk. Since there is this drawback, you can configure parameters so that part of the data in buffer can be directly sent to reduce, thus reducing IO overhead: mapreduce.reduce.input.buffer.percent, default is 0.0. When the value is greater than 0, a specified proportion of the data in the memory read buffer will be reserved for direct use by the reduce. In this way, setting up buffer requires memory, reading data requires memory, and reduce calculations also need memory, so you need to adjust according to the running of the job.

4. IO transmission phase

1) data compression is adopted to reduce the time of network IO. Install the Snappy and LZO compression encoders.

2) use SequenceFile binaries. Faster to write, faster than plain text

5. Data skew

1) data skew phenomenon

Data frequency skew-the amount of data in one area is much larger than that in other areas.

The data size is skewed-the size of some records is much larger than the average.

2) the method of reducing the tilt of data.

MapReduce data skew usually occurs in the reduce phase, because after the data output from map is partitioned, each partition is processed by a reduce. If the amount of data in each partition varies greatly, then the processing time of reduce must be different, resulting in the transmission of bucket effect, so it is necessary to avoid data skew.

Method 1: sampling and range zoning

The partition boundary value can be preset through the result set obtained by sampling the original data. (mentioned in hive)

Method 2: customize the partition

Customize the partition based on the background knowledge of the output key. For example, if the word of the map output key comes from a book. And some of them have more professional words. Then you can customize the partition to send these professional words to a fixed part of the reduce instance. Send the rest to the remaining reduce instances.

Method 3:Combine

Using Combine can greatly reduce data skew. Where possible, the purpose of combine is to aggregate and simplify data.

Method 4: adopt Map Join and avoid Reduce Join as much as possible. The problem of data skew would not be involved without reduce.

3. Common tuning parameters 1, resource-related parameters (1) in mr, the parameters that can be directly configured by the program through the configuration object indicate the upper limit of resources (in MB) that can be used by a Map Task in mapreduce.map.memory.mb. The default is 1024. If the amount of resources actually used by Map Task exceeds this value, it will be forcibly killed. Mapreduce.reduce.memory.mb the upper limit of resources available for a Reduce Task (in MB). The default is 1024. If the amount of resources actually used by Reduce Task exceeds this value, it will be forcibly killed. Mapreduce.map.cpu.vcores the maximum number of cpu core that can be used per Map task. Default: the maximum number of cpu core that can be used by each Reduce task in 1mapreduce.reduce.cpu.vcores, and the default is the number of parallelism in which each reduce of 1mapreduce.reduce.shuffle.parallelcopies goes to map to fetch data. The default value is 5mapreduce.reduce.shuffle.merge.during the merge process on the buffer side, what percentage of the data in the buffer starts to be written to disk. The default 0.66mapreduce.reduce.shuffle.input.buffer.percentbuffer size as a percentage of reduce available memory. The default value of 0.7mapreduce.reduce.input.buffer.percent retains a specified proportion of the data in the memory read buffer and gives it directly to the reduce without having to write to disk and then give it to the reduce. The default value is whether the 0.0mapreduce.map.speculative setting task can be executed concurrently. (2) the parameter parameter at yarn startup specifies the minimum memory allocated by yarn.scheduler.minimum-allocation-mb 1024 to the application container, the maximum memory allocated by yarn.scheduler.maximum-allocation-mb 8192 to the application container, the maximum memory yarn.scheduler.minimum-allocation-vcores 1 per container request, the minimum number of CPU cores requested by yarn.scheduler.maximum-allocation-vcores 32 per container request. Maximum physical memory allocated to containers by yarn.nodemanager.resource.memory-mb 8192 with large CPU cores (3) key factors for shuffle performance optimization Configure parameters in the configuration file before startup to specify the ring buffer size of mapreduce.task.io.sort.mb 100shuffle, the default 100mmapreduce.map.sort.spill.percent 0.8 ring buffer overflow threshold, and the default number of files opened by 80%mapreduce.task.io.sort.factor at the same time. Increase the number of files opened at the same time when merge, reduce the number of merge 2. The parameters related to fault tolerance indicate the maximum number of retries per Map Task of mapreduce.map.maxattempts. Once the retry parameter exceeds this value, Map Task is considered to have failed. The default value is 4. Mapreduce.reduce.maxattempts the maximum number of retries per Reduce Task. Once the retry parameter exceeds this value, the Map Task is considered to have failed. The default value is 4. Mapreduce.task.timeoutTask timeout is a parameter that often needs to be set. This parameter means that if a task does not enter within a certain period of time, that is, it will not read new data or output data, it is considered that the task is in the block state, and it may be stuck or stuck forever. In order to prevent the user program from never exiting the block, a timeout period (in milliseconds) is forcibly set. The default is 600000. If your program takes too long to process each input data (such as accessing the database, pulling data through the network, etc.), it is recommended to increase this parameter. The error message that often occurs when the parameter is too small is "AttemptID:attempt_14267829456721_123456_m_000224_0 Timed out after 300 secsContainer killed by the ApplicationMaster." IV. Optimization of small files

Every file on HDFS has to build an index on namenode, the size of this index is about 150byte, so when there are many small files, it will produce a lot of index files, on the one hand, it will take up a lot of memory space of namenode, on the other hand, the index speed will slow down when the index file is too large.

1. Filing

Hadoop archive is a file archiving tool that efficiently puts small files into HDFS blocks. It can package multiple small files into a single HAR file, thus reducing the memory usage of namenode.

2 、 Sequence file

sequence file consists of a series of binary key/value. If key is the file name and value is the file content, you can merge a large number of small files into one large file.

When merges multiple files into one file, it needs to provide an index file indicating the starting position, length and other information of each file in the total file.

3 、 CombineFileInputFormat

inherits from FileInputFormat, and the implementation subclass is CombineTextInputFormat. It is a new inputformat for merging multiple files into a single split, and it takes into account where the data is stored.

4. Enable JVM reuse

For a large number of small files Job, you can turn on JVM reuse, which reduces running time by 45%.

JVM reuse understanding: a map runs a jvm, and if reused, jvm continues to run other map after a map has finished running on the jvm.

Specific settings: the mapreduce.job.jvm.numtasks value is between 10 and 20.

But there is a drawback to turning on jvm. However, after a jvm has been used by a map or reduce task, other MapReduce tasks can use the jvm only after the entire current MapReduce task ends, that is, even if the jvm is free, it cannot be used by other MapReduce tasks. It must cause a waste of resources in procedure.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.