What are the knowledge points of hadoop MapReduce? 07/06 Update SLTechnology News&Howtos

What are the knowledge points of hadoop MapReduce?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the knowledge points of hadoop MapReduce". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

An application programming specification

1. Set up Combiner

For a large number of MapReduce programs, it is very helpful to improve job performance if you can set a Combiner. Combiner reduces the results of Map Task intermediate output, thereby reducing the amount of remote copy data for each Reduce Task, resulting in reduced Map Task and Reduce Task execution times.

two。 Choose a reasonable Writable type

In the MapReduce model, both the input and output types of Map Task and Reduce Task are Writable. Hadoop itself already provides many Writable implementations, including IntWritable and FloatWritable. Choosing the appropriate Writable type for the data processed by the application can greatly improve performance. For example, when dealing with integer type data, it is more efficient to use IntWritable directly than to first read in with Text type before converting to integer type. If most of the output integers can be saved in one or two bytes, then directly use VIntWritable or VLongWritable, which uses variable length integer encoding, which can greatly reduce the amount of output data.

Second, tuning the parameters of the job level

1. Plan a reasonable number of tasks

In Hadoop, each Map Task processes one Input Split. The division of Input Split is determined by the user-defined InputFormat, which is determined by the following three parameters by default.

Mapred.min.split.size: the default value of the minimum Input Split is 1

Mapred.max.split.szie: the maximum value of Input Split

The default size of a block in dfs.block.size:HDFS is 64MB

Golsize: it is the number of Input Split expected by the user = totalSize/numSplits, where totalSize is the total size of the file, and numSplits is the number of Map Task set by the user, which is 1 by default.

SplitSize = max {minSize,min {goalSize,blockSize}} if you want the InputSize size to be larger than the block size, simply increase the configuration parameter mpared.min.split.size.

two。 Increase the number of copies of the input file

If a job executes a large number of tasks in parallel, the common input file of these tasks may become a bottleneck. In order to prevent multiple tasks from reading the contents of a file in parallel, users can increase the number of copies of the input file as needed.

3. Start the speculative execution mechanism

It is speculated that execution is an optimization mechanism for "slow" tasks by Hadoop. When some tasks of a job run significantly slower than other tasks of the same job, Hadoop will start a backup task for a "slow task" on another node, so that two tasks process a piece of data at the same time, and Hadoop will eventually take the result of the priority task as the final result and kill the other task.

4. Set the tolerance for failure

Hadoop runs set task-level and job-level tolerance for failures. Job-level failure tolerance means that Hadoop allows a certain proportion of tasks to fail in each job, and the input data corresponding to these tasks will be ignored.

Task-level failure tolerance means that Hadoop allows a task to fail and then try to run on another node. If a task fails after several attempts, then Hadoop will finally consider that the task failed.

Users should set reasonable failure tolerance according to the characteristics of the application, so as to make the job run as soon as possible and avoid unnecessary waste of resources.

5. Turn on JVM reuse function appropriately

In order to achieve task isolation, Hadoop executes each task in a separate JVM, while for tasks with short execution time, JVM startup and shutdown take a large proportion of time, so users can enable JVM reuse so that a JVM can start multiple tasks of the same type in succession.

6. Set task timeout

If a task does not report progress within a certain period of time, TaskTracker actively kills it and restarts execution on another node. Users can configure the task timeout according to their actual needs.

7. Rational use of DistributedCache

In general, there are two ways to get external files: one is to put the external files on the client side together with the application jar package, and when submitting the job, the client uploads it to a directory in HDFS, and then distributes it to each node through Distributed Cache; the other method is to put the external files directly on HDFS in advance, and the second method is more efficient in terms of efficiency. The second method not only saves the time for the client to upload files, but also implicitly tells DistributedCache: "Please download the files to the pubic-level shared directory of each node", so that all subsequent jobs can reuse the downloaded text without having to download it repeatedly.

8. Skip the bad record

Hadoop provides users with the ability to skip bad records, and Hadoop can automatically identify and skip bad records when one or more bad data records cause a task to fail.

9. Increase job priority

All Hadoop job schedulers take job priority into account when scheduling tasks. The higher the priority of a job, the more resources it can get (the number of slot). Hadoop provides five job priorities, which are VERY_HIGH, HIGH, NORMAL, LOW, and VERY_LOW.

Note: in the production environment, administrators have graded jobs according to their importance. Jobs with different degrees of importance are allowed to be configured with different priorities, and users can adjust them without authorization.

10. Reasonably control the starting time of Reduce Task

If Reduce Task starts too early, it may be due to the "slot Hoarding" phenomenon caused by Reduce Task occupying Reduce slot resources for a long time, which reduces resource utilization; otherwise, if Reduce Task starts too late, it will cause Reduce Task to obtain resources late and increase the running time of the job.

Three task level parameter tuning

Hadoop task-level parameter tuning is divided into two aspects: Map Task and Reduce Task.

1.Map Task tuning

The running stage of map is divided into five stages: Read, Map, Collect, Spill and Merge.

Map task execution produces intermediate data, but these intermediate results are not directly IO to disk, but are first stored in the cache (buffer), and some pre-sorting is performed in the cache to optimize the performance of the entire map. The default cache size for storing map intermediate data is 100m, which is specified by the io.sort.mb parameter. This size can be adjusted as needed. When the map task produces very large intermediate data, you can appropriately increase this parameter, so that the cache can hold more map intermediate data, instead of high-frequency IO disks. When the bottleneck of system performance is the speed of disk IO, you can appropriately increase this parameter to reduce the performance obstacles caused by frequent IO.

Since the intermediate results of the map task are first stored in the cache, the default is to write to disk when the cache usage reaches 80% (or 0.8). This process is called spill (also known as overflow). The cache size of spill can be adjusted by the io.sort.spill.percent parameter, which can affect the frequency of spill. In turn, it can affect the frequency of IO.

When the map task calculation completes successfully, if the map task has output, multiple spill will be generated. Next, map must merge some spill. This process is called merge, and the merge process processes spill in parallel. The number of spill in parallel is specified by the parameter io.sort.factor. The default is 10. However, when the number of spill is very large, the number of spill of merge running in parallel is still 10, so there will still be frequent IO processing, so properly increasing the number of spill per parallel processing can help to reduce the number of merge and thus affect the performance of map.

Compression can also be configured when map outputs intermediate results.

2. Reduce Task tuning

The running phase of reduce is divided into five stages of shuflle (copy) merge sort reduce write.

The shuffle phase is the intermediate result of the successful completion of the reduce full copy map task. If the above map task uses compression, then reduce copies the intermediate result of the map task and decompresses it first. All this is done in the reduce cache, and of course it also takes up part of the cpu. To optimize the execution time of reduce, reduce does not wait until all the map data is copied before running the reduce task, but when job finishes executing the first map task. In the shuffle phase, reduce actually downloads its own data from different and completed map. Because of the large number of map tasks, all this copy process is parallel, that is, there are many reduce fetching copies of map at the same time, and this parallel thread is specified through the mapred.reduce.parallel.copies parameter. The default is 5, that is, no matter how many tasks the map has. By default, there can only be 5 reduce threads at a time to copy the execution results of map tasks. So when there are a large number of map tasks, you can adjust this parameter appropriately, so that reduce can quickly get the running data to complete the task.

When a reduce thread downloads map data, the datannode in which the map data is stored may fail for a variety of reasons (network reasons, system reasons, etc.). In this case, the reduce task will not get the data on the datanode, and the download thread will try to download from another datanode. You can adjust the download time of the download thread through mapred.reduce.copy.backoff (default is 30 seconds). If the cluster with a bad network can increase the download time by increasing the value of this parameter, reduce will not judge the thread as a download failure because the download time is too long.

When the reduce download thread downloads the map results locally, it also needs to merge the downloaded data because it is downloaded in parallel with multiple threads, so the io.sort.factor set in the map phase will also affect this reduce.

Like map, the buffer size does not wait until it is fully occupied before writing to disk. By default, the write disk operation begins when .66 is complete, which is specified by mapred.job.shuffle.merge.percent.

When reduce starts to calculate, use mapred.job.reduce.input.buffer.percent to specify how much memory percentage is needed as the buffer percentage that reduce reads sort-ready data, which defaults to 0. Hadoop assumes that the user's reduce () function requires all JVM memory, so free up all memory before executing the reduce () function. If you set this value, you can save some of the files in memory (you don't have to write to disk).

In short, one of the principles of Map Task and Reduce Task tuning is to reduce the amount of data transferred, use memory as much as possible, reduce the number of disk IO, increase the number of parallel tasks, and tune according to the actual situation of your own cluster and network.

Tuning the angle of three administrators

The administrator is responsible for providing an efficient running environment for user jobs. Administrators need to improve the throughput and performance of the system by adjusting some key parameters from the overall situation. Generally speaking, administrators need to provide an efficient job running environment for Hadoop users from four angles: hardware selection, system parameter tuning, JVM parameter tuning and Hadoop parameter tuning.

Hardware selection

The basic characteristics of Hadoop's own architecture determine the options of its hardware configuration. Hadoop uses the Master/Slave architecture, in which master maintains global metadata information, which is far more important than slave. In lower versions of Hadoop, master has a single point of failure, so the configuration of master should be much better than that of individual slave.

Operating system parameter tuning

1. Increase the upper limit of file descriptors and network connections that are open at the same time

Use the ulimit command to increase the maximum number of file descriptors allowed to open at the same time to an appropriate value. At the same time, adjust the kernel parameter net.core.somaxconn network connection number to a large enough value.

Supplement: the role of net.core.somaxconn

Net.core.somaxconn is a kernel parameter in Linux, which represents the upper limit of backlog for socket snooping (listen). What is backlog? Backlog is the listening queue for socket, and when a request (request) has not been processed or established, it enters the backlog. On the other hand, socket server can process all requests in backlog at once, and the processed requests are no longer in the listening queue. When server processes requests so slowly that the listening queue is filled, new requests are rejected. In Hadoop 1.0, the parameter ipc.server.listen.queue.size controls the listening queue length of the server-side socket, that is, the backlog length, with a default value of 128. The default value of the parameter net.core.somaxconn for Linux is also 128. When the server is busy, such as NameNode or JobTracker,128, it is far from enough. This requires an increase in backlog. For example, our 3000 clusters set ipc.server.listen.queue.size to 32768. In order to make the whole parameter achieve the desired effect, we also need to set the kernel parameter net.core.somaxconn to a value greater than or equal to 32768.

two。 Close the swap partition

Avoid using swap partitions to provide program execution efficiency.

In addition, set a reasonable pre-read buffer size, file system selection and configuration, and Istroke O scheduler selection, etc.

JVM parameter tuning

Because each service and task in Hadoop runs in a separate JVM, some important parameters of JVM also affect Hadoop performance. Administrators can improve Hadoop performance by adjusting the JVM FLAGS and JVM garbage collection mechanisms.

Hadoop parameter tuning

1. Rational planning of resources

Set a reasonable number of slots

In Hadoop, computing resources are represented by slots. There are two types of slot: Map Slot and Reduce Slot. Each slot represents a certain amount of resources, and the same slot is homogeneous, that is to say, the same slot represents the same amount of resources. Administrators need to configure a certain number of Map Slot and Reduce Slot for TaskTracker according to their actual needs, thus limiting the number of Map Task and Reduce Task executed concurrently on each TaskTracker.

Write a health monitoring script

Hadoop allows administrators to configure a node health monitoring script for each TaskTracker. TaskTracker contains a dedicated thread that periodically executes the script and reports the results of the script execution to JobTracker through a heartbeat mechanism. Once JobTracker discovers that a TaskTracker is currently in an "unhealthy" state, it is blacklisted and no more tasks are assigned to it.

two。 Adjust heartbeat configuration

Adjust the heartbeat interval because you appropriately adjust the heartbeat interval according to the size of your cluster.

Enable out-of-band heartbeat to reduce task assignment delays, Hadoop introduces out-of-band heartbeats. Different from the regular heartbeat, the out-of-band heartbeat is triggered when the task ends or fails, and it can notify JobTracker as soon as there are free resources, so that it can quickly assign new tasks to the idle resources.

In addition, it also includes disk block configuration, setting a reasonable number of RPC Handler and HTTP threads, careful use of blacklist mechanism, enabling batch task scheduling, selecting appropriate compression algorithm, enabling pre-reading mechanism, and so on.

Note: when the size of a cluster is small, if a certain number of nodes are frequently added to the system blacklist, the throughput and computing power of the cluster will be greatly reduced.

This is the end of the content of "what are the knowledge points of hadoop MapReduce". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.