Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the entry-level knowledge points of Hadoop

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the entry points of Hadoop knowledge". In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

Getting Started with Hadoop

Hadoop Enterprise Optimization for Everyday Work

1 Reasons why MapReduce runs slowly

Mapreduce program efficiency bottleneck lies in two points:

1) Computer performance

CPU, memory, disk health, network

2) I/O operation optimization

data skew

Map and reduce numbers are set incorrectly

Reduce waiting too long

Too many small files

A large number of non-chunkable oversized files

Too many spills

Too many merges, etc.

2 MapReduce Optimization Methods

MapReduce optimization method is mainly considered from the following six aspects:

2.1 data input

Merge small files: Merge small files before executing mr tasks. A large number of small files will generate a large number of map tasks, increasing the number of map tasks loaded, and the loading of tasks is more time-consuming, resulting in mr running slower.

ConbinFileInputFormat is used as input to solve a large number of small file scenarios on the input side.

2.2 Map phase

Reduce Spill Times: Reduce disk IO by adjusting io.sort.mb and sort.spill.percent parameter values to increase the memory cap that triggers spill and reduce the number of spill times.

Reduce merge times: Reduce mr processing time by adjusting the io.sort.factor parameter to increase the number of files to merge and decrease the number of merges.

Combine the map first to reduce I/O.

2.3 Reduce stage

Set map and reduce numbers reasonably: neither can be set too few nor too many. Too little will lead to task waiting and prolong processing time; too much will lead to resource competition between map and reduce tasks, resulting in errors such as processing timeout.

Set map and reduce coexistence: Adjust the slowstart.completedmaps parameter so that after map runs to a certain extent, reduce also starts running, reducing the wait time of reduce.

Avoid using reduce because Reduce will cause a lot of network consumption when used to join data sets.

Set the buffer on the reduce side properly. By default, when the data reaches a threshold, the data in the buffer will be written to disk, and then reduce will get all the data from disk. In other words, buffer and reduce are not directly related, and there is a process of writing disk-> reading disk in the middle. Since there is this disadvantage, it can be configured by parameters, so that part of the data in buffer can be directly transmitted to reduce, thus reducing IO overhead: mapred.job.reduce.input.buffer.percent, default is 0.0. When the value is greater than 0, the specified proportion of the memory read buffer data will be reserved for use by reduce directly. In this way, memory is required to set buffer, read data, and reduce calculations, so it needs to be adjusted according to the operation of the job.

2.4 IO transmission

Data compression is used to reduce network IO time. Install Snappy and LZOP compression encoders.

Using SequenceFile binaries

2.5 data skew problem

data skew

Data Frequency Skew-The amount of data in one area is much larger than in other areas.

Data Size Skew-Some records are much larger than average.

How to collect tilt data

Add the ability to record details of map output keys to the reduce method.

public static final String MAX_VALUES = "skew.maxvalues"; private int maxValueThreshold; @Overridepublic void configure(JobConf job) { maxValueThreshold = job.getInt(MAX_VALUES, 100); } @Overridepublic void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int i = 0; while (values.hasNext()) { values.next(); i++; } if (++i > maxValueThreshold) { log.info("Received " + i + " values for key " + key); }}

3) Methods to reduce data skew

Sampling and range zoning

Partition boundary values can be preset by sampling the result set of the original data.

custom partitioning

Another alternative to sample and range partitioning is custom partitioning based on background knowledge of the output keys. For example, if the word for the map output key comes from a book. Most of them must be stopwords. Then custom partitions can be sent to a fixed subset of reduce instances with this subset of ellipses. The rest is sent to the remaining reduce instances.

Combine

Use Combine to significantly reduce data frequency skew and data size skew. Where possible, the purpose of combining is to aggregate and streamline data.

Use Map Join to avoid Reduce Join as much as possible.

2.6 Common tuning parameters

1) Resource-related parameters

The following parameters take effect when configured in the user's own mr application (mapred-default.xml)

Configuration parameter parameter Description mapreduce.map.memory.mb The maximum number of resources that can be used by a Map Task (unit:MB). The default value is 1024. If the actual resource usage of Task Map exceeds this value, it will be forcibly killed. mareduce.reduce.memory.mb The maximum number of resources (in MB) that a Reduce Task can use. The default value is 1024. If the Reduce Task actually uses more resources than this value, it will be forcibly killed. mareduce.map.cpu.vcores Maximum number of cpu cores per Map task, default: 1mareduce.reduce.cpu.vcores Maximum number of cpu cores per Reduce task, default: 1mareduce.reduce.shuffle.parallelcopies Number of parallelisms per Reduce map. The default value is 5mapreduce.reduce.shuffle.merge.percentbuffer at which percentage of data starts to be written to disk. Default 0.66mapreduce.reduce.shuffle.input.buffer.percentbuffer Size as a percentage of reduce available memory. Default 0.7mapreduce.reduce.input.buffer.percent Specifies what proportion of memory is used to store the data in the buffer, default 0.0

It should be configured in the server configuration file before yarn starts (yarn-default.xml)

configuration parameters parameter description yarn.scheduler.minimum-allocation-mb 1024 minimum memory allocated to application containers yarn. scheduler.maximum-allocation-mb 8192 maximum memory allocated to application containers yarn. scheduler.minimum-allocation-vcores 1 minimum CPU cores requested per container yarn.scheduler.maximum-allocation-vcores 32 maximum CPU cores requested per container yarn.nodemanager.resource.memory-mb 8192 maximum physical memory allocated to containers

Key parameters for shuffle performance optimization should be configured before yarn starts (mapred-default.xml)

Configuration parameter Parameter Description mapreduce.task.io.sort.mb 100shuffle ring buffer size, default 100mmareduce.map.sort.spill.percent 0.8 ring buffer overflow threshold, default 80%

2) Fault-tolerance related parameters (mapreduce performance optimization)

Configuration parameter mapreduce.map.maxattempts Maximum number of retries per Map Task. If the retry parameter exceeds this value, it is considered that Map Task fails to run. Default value: 4mapreduce.maxattempts Maximum number of retries per Reduce Task. If the retry parameter exceeds this value, it is considered that Map Task fails to run. Default value: 4. mareduce.task.timeoutTask timeout time, often need to set a parameter, the meaning of this parameter expression is: if a task does not enter within a certain period of time, that is, will not read new data, there is no output data, it is considered that the task is in the block state, may be stuck, may always be stuck, in order to prevent the user program never block does not exit, then force to set a timeout time (unit milliseconds), default is 600000. If your program takes too long to process each input data (such as accessing the database, pulling data through the network, etc.), it is recommended to increase this parameter. The error prompt that often appears when this parameter is too small is "AttemptID:attempt_14267829456721_123456_m_000224_0 Timed out after 300 secsContainer killed by the ApplicationMaster." "。

3 HDFS small file optimization method

3.1 HDFS Small File Disadvantages

Each file on HDFS needs to establish an index on the namenode. The size of this index is about 150 bytes. In this way, when there are many small files, a lot of index files will be generated. On the one hand, it will take up a lot of memory space of the namenode. On the other hand, the index file is too large to slow down the index speed.

3.2 solutions

Hadoop Archive:

HAR is a file archiving tool that efficiently puts small files into HDFS blocks. It can package multiple small files into a HAR file, thus reducing namenode memory usage.

Sequence file:

sequence file consists of a series of binary keys/values. If key is the file name and value is the file content, a large number of small files can be merged into a large file.

CombineFileInputFormat:

CombineFileInput Format is a new input format for combining multiple files into a single split, taking into account where the data is stored.

Turn on JVM reuse

For a large number of small file jobs, turning on JVM reuse reduces runtime by 45%.

JVM reuse understanding: a map runs a jvm, reuse words, after a map runs on jvm, jvm continues to run other jvm

Specific settings: mapreduce.job.jvm.numtasks value between 10-20.

The content of "Hadoop entry knowledge points" is introduced here. Thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report