How to optimize the parameters of hadoop system 07/13 Update SLTechnology News&Howtos

How to optimize the parameters of hadoop system

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how to optimize the parameters of hadoop system. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Parameter optimization of hadoop system

In order to improve its data performance, many people begin to optimize Hadoop. To sum up, there are several main optimization ideas for Hadoop:

(1) optimize from the point of view of the application. Because mapreduce parses data files line by line iteratively, how to write efficient applications in the case of iteration is an optimization idea.

(2) Hadoop parameters are tuned. At present, there are more than 190 configuration parameters in hadoop system. How to adjust these parameters to make hadoop jobs run as fast as possible is also an optimization idea.

(3) optimize from the point of view of system implementation. This kind of optimization is the most difficult. From the point of view of hadoop implementation mechanism, it finds the shortcomings in the current Hadoop design and implementation, and then modifies it at the source code level. Although this method is difficult, it is often effective.

3.2.1 Linux file system parameter tuning

(1) noatime and nodiratime attributes

Setting these two properties when the file is mounted can significantly improve performance. By default, the Linux ext2/ext3 file system records the timestamp of the file when it is accessed, created, or modified, such as the file creation time, the last modification time, and the last access time. If the system needs to access a large number of files while the system is running, closing these operations can improve the performance of the file system. Linux provides the parameter noatime to disable the recording of the last access timestamp.

(2) readahead buffer

Adjusting the size of the pre-read buffer in the linux file system can significantly improve the performance of sequential read files. The default buffer size is 256 sectors, which can be increased to 1024 or 2408 sectors (note that bigger is not better). You can use the blockdev command to make adjustments.

Command:

Check out: blockdev-report

Change: blockdev-- setra 1024 / dev/sda

(3) avoid RAID and LVM operations

Avoid performing RAID and LVM operations on TaskTracker and DataNode machines, which usually degrades performance.

3.2.2 Hadoop general parameter adjustment

(1) dfs.namenode.handler.count or mapred.job.tracker.handler.count (hdfs-default)

The number of threads used to process RPC in namenode or jobtracker is 10 by default, which is larger and larger, such as 64.

(2) dfs.datanode.handler.count (hdfs-default)

The number of threads used to process RPC on the datanode. The default is 3, a larger cluster, which can be adjusted to a larger size, such as 8. It is important to note that each additional thread increases the amount of memory required.

(3) mapreduce.tasktracker.http.threads (mapred-default)

The number of threads on HTTP server. Runs on each TaskTracker to process the map task output. A large cluster, which can be set to 40-50.

3.2.3 HDFS related configuration

(1) dfs.replication (hdfs-default)

The number of copies of the file, usually set to 3, modification is not recommended.

(2) dfs.block.size (hdfs-default)

The data block size in HDFS is 64m by default. For larger clusters, it can be set to 128MB or 256MB. (it can also be configured through the parameter mapred.min.split.size)

(3) mapred.local.dir (mapred-default) and dfs.data.dir (hdfs-default)

The values of these two parameters mapred.local.dir and dfs.data.dir configuration should be distributed in directories on each disk, so that you can make full use of the node's IO read and write capabilities. Running the iostat-dx 5 command under the Linux sysstat package allows each disk to show its utilization.

3.2.4 map/reduce related configuration

(1) {map/reduce} .tasks.maximum (mapred-default)

The maximum number of map/reduce task running on TaskTracker at the same time is generally set to (core_per_node) / 2x2* (cores_per_node).

(2) io.sort.factor (mapred-default)

When a map task is executed, there are several spill files on the local disk (mapred.local.dir), and the last thing map task does is execute merge sort to synthesize these spill files into one file (partition). When merge sort is executed, how many spill files are opened at the same time is determined by this parameter. The more files you open, the faster the merge sort may not be, so adjust it appropriately according to the data situation.

(3) mapred.child.java.opts (mapred-default)

Setting the maximum available memory of the JVM heap needs to be configured from an application perspective.

3.2.5 map task related configuration

(1) io.sort.mb (mapred-default)

The output of Map task and the total buffer size of metadata in memory. The default is 100m, which can be set to 200m for large clusters. When the buffer reaches a certain threshold, a background thread is started to sort the contents of the buffer and then write to the local disk (a spill file).

(2) io.sort.spill.percent (mapred-default)

This value is the threshold of the above buffer. The default is 0.8, that is, 80%. When the data in the buffer reaches this threshold, the background thread will sort the existing data in the buffer and then write it to disk.

(3) io.sort.record.percent (mapred-default)

The percentage of memory allocated to metadata in Io.sort.mb, which defaults to 0. 05. This needs to be adjusted according to the application.

(4) mapred.compress.map.output/ Mapred.output.compress (mapred-default)

Whether the intermediate result and the final result should be compressed, and if so, specify the compression method (Mapred.compress.map.output.codec/ Mapred.output.compress.codec). LZO compression is recommended. Intel internal tests show that the running time of TeraSort jobs with LZO compression is 60% less than that without compression, and is significantly faster than Zlib compression.

3.2.6 reduce task related configuration

(1) Mapred.reduce.parallel

The number of copier threads in the Reduce shuffle phase. The default is 5, which can be adjusted to 16-25 for larger clusters.

This parameter-based tuning is more "static" because a set of parameter configurations is optimal for only one type of job. Through the study of these parameters, we can find the relationship between parameter configuration and different job characteristics.

Thank you for reading! This is the end of this article on "how to optimize hadoop system parameters". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.