How to optimize the performance of hadoop layer 07/01 Update SLTechnology News&Howtos

How to optimize the performance of hadoop layer

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "how to tune the performance of the hadoop level", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "how to optimize the performance of the hadoop level" this article.

Hadoop layer performance tuning 1. Daemon for memory tuning

A) NameNode and DataNode memory adjustments in the hadoop-env.sh file

NameNode: ExportHADOOP_NAMENODE_OPTS= "- Xmx512m-Xms512m-Dhadoop.security.logger=$ {HADOOP_SECURITY_LOGGER:-INFO,RFAS}-Dhdfs.audit.logger=$ {HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"

DataNode:

Export HADOOP_DATANODE_OPTS= "- Xmx256m-Xms256m-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

The two parameters-Xmx-Xms are generally the same to avoid JVM reallocation of memory after each garbage collection.

B) REsourceManager and NodeManager memory adjustments in the yarn-env.sh file

REsourceManager:

Export YARN_RESOURCEMANAGER_HEAPSIZE=1000 defaults to export YARN_RESOURCEMANAGER_OPTS= "." You can override the above values

NodeManager:

Export YARN_NODEMANAGER_HEAPSIZE=1000 default export YARN_NODEMANAGER_OPTS= "; can override the above values

Resident memory experience configuration:

Namenode:16G

Datanode:2-4G

ResourceManager:4G

NodeManager:2G

Zookeeper:4G

Hive Server:2G

2. Configure multiple mr intermediate directories to disperse the IO pressure

Http://hadoop.apache.org/docs/r2.6.0/

Profile yarn-default.xml disperses IO pressure

Yarn.nodemanager.local-dirs

Yarn.nodemanager.log-dirs

Profile mapred-default.xml:

Mapreduce.cluster.local.dir

Profile hdfs-default.xml: improving reliability

Dfs.namenode.name.dir

Dfs.namenode.edits.dir

Dfs.datanode.data.dir

3. Mr intermediate results should be compressed

A) configuration in the configuration mapred-site.xml file

Mapreduce.map.output.compress

True

Mapreduce.map.output.compress.codec

Org.apache.hadoop.io.compress.SnappyCodec

Specify the parameter hadoop jar / home/hadoop/tv/tv.jar MediaIndex-Dmapreduce.compress.map.output=true-Dmapreduce.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec / tvdata / media when the program is running

B) use reasonable compression algorithms (cpu and disk) cpu: if it is the bottleneck of cpu, you can replace the fast compression algorithm disk: if it is the bottleneck of the disk, you can replace the compression algorithm with high compression strength. In general, we use snappy compression to balance lzo.

4. Avoid in hdfs file system, a large number of small files exist in 5. According to the specific situation, use Combiner on the Map node to reduce the output.

6. Reuse Writable types

For example, declare an object Text word = new Text (); map (), reuse in the reduce () method

7. Adjust the parallelism of task according to the specific conditions of cluster nodes

Set the maximum number of map and reduce tasks:

Mapreduce.tasktracker.map.tasks.maximum

Mapreduce.tasktracker.reduce.tasks.maximum

Profile mapred-default.xml:

Set the map and reduce memory size for a single task:

Mapreduce.map.memory.mb 1G default

Mapreduce.reduce.memory.mb 1G default

8. For effective monitoring methods (using nmon, ganglia will be deployed to collect various metrics, analyze metrics to find bottlenecks, and then specify measures) hardware-level performance tuning:

The rack is separated and the nodes are placed evenly.

Performance tuning at the operating system level:

Multiple network cards: bind multiple network cards to do load balancing or active / standby

Disk: multiple disks are mounted to different directories. Disks that store data for calculation should not be raid.

Cluster planning:

Cluster node memory allocation:

For example, a data node, if task parallelism is 8 DataNode (2x4G) + NodeManager (2G) + Zookeeper (4G) + 1G (default size of a single task) * 8=16G~18G

Cluster size: if 1T of data per day is saved for one month, it is common for enterprises to keep data for 7 days and 15 days per node if the hard disk of each node is 2T 1T*3 (replica) * 30 (days) = 90T (replica) 2T * (60cm 70%) nautical 60 nodes. If the data is more important, one month.

These are all the contents of the article "how to tune the performance at the hadoop level". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.