Hadoop enterprise optimization 07/03 Update SLTechnology News&Howtos

Hadoop enterprise optimization

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1 the reason why MapReduce runs slowly

2 MapReduce optimization method

MapReduce optimization method mainly considers six aspects: data input, Map phase, Reduce phase, IO transmission, data tilt problem and common tuning parameters.

2.1 data entry

2.2 Map Pha

2.3 Reduce Pha

2.4 IO transmission

2.5 data skew problem

2.6 commonly used tuning parameters 2.6.1 Resource related parameters

The following parameters take effect when configured in the user's own MR application [mapred-default.xml]

The configuration parameter specifies the upper limit of resources (in MB) that can be used by a MapTask in mapreduce.map.memory.mb. The default is 1024. If the amount of resources actually used by MapTask exceeds this value, it will be forced to kill mapreduce.reduce.memory.mb the upper limit of resources available to a ReduceTask (in MB), which defaults to 1024. If the amount of resources actually used by ReduceTask exceeds this value, it will be forced to kill the maximum number of cpu core that can be used by each MapTask of mapreduce.map.cpu.vcores. Default: the maximum number of cpu core that can be used by 1mapreduce.reduce.cpu.vcores per ReduceTask, and default: the number of parallelism of data fetched from Map by each Reduce of 1mapreduce.reduce.shuffle.parallelcopies. The default value is the percentage of data in 5mapreduce.reduce.shuffle.merge.percentBuffer that starts to be written to disk. The default 0.66mapreduce.reduce.shuffle.input.buffer.percentBuffer size as a percentage of Reduce available memory. The default value, 0.7mapreduce.reduce.input.buffer.percent, specifies what percentage of memory is used to store data in Buffer. The default value is 0.0

It should be configured in the server configuration file before YARN starts to take effect [yarn-default.xml]

The configuration parameter describes the minimum memory allocated by yarn.scheduler.minimum-allocation-mb to the application Container. The default value is the maximum memory allocated by 1024yarn.scheduler.maximum-allocation-mb to the application Container, the default value is the minimum number of CPU cores requested by 8192yarn.scheduler.minimum-allocation-vcores per Container, and the default value is the maximum number of CPU cores requested by 1yarn.scheduler.maximum-allocation-vcores per Container Default: maximum physical memory allocated by 32yarn.nodemanager.resource.memory-mb to Containers, default: 8192

For the key parameters of Shuffle performance optimization, [mapred-default.xml] should be configured before YARN starts.

The configuration parameter describes the ring buffer size of mapreduce.task.io.sort.mbShuffle, the default threshold for 100mmapreduce.map.sort.spill.percent ring buffer overflow, and the default 80% 2.6.2 fault tolerance related parameter (MapReduce performance optimization). The configuration parameter indicates the maximum number of retries per Map Task of mapreduce.map.maxattempts. Once the retry parameter exceeds this value, the Map Task is considered to have failed. Default value: the maximum number of retries per Reduce Task of 4mapreduce.reduce.maxattempts. Once the retry parameter exceeds this value, the Map Task is considered to have failed. The default value: 4mapreduce.task.timeoutTask timeout, a parameter that often needs to be set. This parameter means that if a Task does not enter within a certain period of time, that is, it will not read new data or output data, the Task is considered to be in the Block state and may be stuck. It may always be stuck, in order to prevent the user program from never exiting from Block, a timeout (in milliseconds) is forcibly set, which defaults to 600000. If your program takes too long to process each input data (such as accessing the database, pulling data through the network, etc.), it is recommended to increase this parameter. The common error prompt that this parameter is too small is "AttemptID:attempt_14267829456721_123456_m_000224_0 Timed out after 300 secsContainer killed by the ApplicationMaster." 3 HDFS small file optimization method 3.1 HDFS small file malpractice

Every file on HDFS has to build an index on NameNode, the size of this index is about 150byte, so when there are more small files, it will produce a lot of index files, on the one hand, it will take up a lot of memory space of NameNode, on the other hand, the index file is too large to slow down the index.

3.2 HDFS small file solution

There are no more than the following ways to optimize small files:

When collecting data, small files or small batches of data are synthesized into large files and then uploaded to HDFS. Use the MapReduce program on HDFS to merge small files before business processing. In MapReduce processing, CombineTextInputFormat can be used to improve efficiency.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.