HDFS Cluster Optimization of cdh 003 04/20 Update SLTechnology News&Howtos

HDFS Cluster Optimization of cdh 003

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Thursday, 2019-3-21

1. Operating system level optimization selects the file system of Linux: XFS file system.

2. Pre-read buffer

Pre-read technology can effectively reduce the number of disk seek and the applied I _ sectors,128KB waiting time, increase the size of Linux file system read-ahead buffer (default is 256 sectors), and significantly improve the read performance of sequential files. It is recommended to adjust it to 1024 or 2048 sectors. Setting the read-ahead buffer can be done with the blockdev command.

[root@NewCDH-0--141 ~] # df-Th

Filesystem Type Size Used Avail Use% Mounted on

/ dev/mapper/centos-root xfs 50G 45G 5.7G 89% /

Devtmpfs devtmpfs 7.8G 0 7.8G 0% / dev

Tmpfs tmpfs 7.8G 0 7.8G 0% / dev/shm

Tmpfs tmpfs 7.8G 49m 7.8G 1% / run

Tmpfs tmpfs 7.8G 0 7.8G 0% / sys/fs/cgroup

/ dev/mapper/centos-home xfs 46G 342M 46G 1% / home

/ dev/sda1 xfs 497m 121m 377m 25% / boot

Tmpfs tmpfs 1.6G 0 1.6G 0% / run/user/0

Cm_processes tmpfs 7.8G 58m 7.7G 1% / run/cloudera-scm-agent/process

Tmpfs tmpfs 1.6G 0 1.6G 0% / run/user/997

[root@NewCDH-0--141] # blockdev-- getra / dev/mapper/centos-root

8192

[root@NewCDH-0--141] # blockdev-- getra / dev/mapper/centos-home

8192

The modified command is:

Blockdev-setra 2048 / dev/mapper/centos-home

3. Abandon RAID and LVM disk management and choose JBOD.

JBOD

JBOD is a storage device with multiple disk drives installed on a backplane. JBOD does not use front-end logic to manage disk data, and each disk can be addressed independently in parallel. Deploying DataNode on servers where JBOD devices are configured can improve DataNode performance.

4. Memory tuning swap

5. Adjust the memory allocation policy

6 、. Network parameter tuning

II. Performance optimization of HDFS cluster

Archiving

View archive files / / A large number of small files suitable for managing hdfs

[root@NewCDH-0--141] # sudo-u hdfs hadoop fs-ls har:///newdata.har

Found 1 items

Drwxr-xr-x-hdfs supergroup 0 2018-03-19 18:37 har:///newdata.har/mjh

[root@NewCDH-0--141] # sudo-u hdfs hadoop fs-ls har:///newdata.har/mjh

Found 3 items

Drwxr-xr-x-hdfs supergroup 0 2018-03-19 18:37 har:///newdata.har/mjh/shiyanshuju

-rw-r--r-- 3 hdfs supergroup 17 2018-03-19 18:37 har:///newdata.har/mjh/test.txt

-rw-r--r-- 3 hdfs supergroup 12 2018-03-19 18:37 har:///newdata.har/mjh/test2.txt

Compress

III. HDFS cluster configuration optimization

1. The number of server threads in dfs.namenode.handler.count NameNode.

The number of threads used to process RPC calls in NameNode. The default is 32. For larger clusters and better configured servers, you can appropriately increase this value to improve the concurrency of NameNode RPC services.

The general principle is to set it to the natural logarithm of the cluster size multiplied by 20, that is, 20 logn N is the cluster size N number of cluster servers.

My HDFS cluster configuration is: 128T hard disk, 32corefield 128g memory. ) this value is set to: 20 log12=120

4 nodes is 20 log4 = 40.

/ / there is a new explanation for this section on the official cdh website.

In hdfs

Dfs.namenode.service.handler.count and dfs.namenode.handler.count-for each NameNode, set to ln (the number of DataNode in this HDFS service) 20.

/ / We datanode 4 nodes, so

NameNode service handler count

The original text is: dfs.namenode.service.handler.count and dfs.namenode.handler.count-For each NameNode, set to ln (number of DataNodes in this HDFS service) 20.

Reference link: https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cm_mc_autoconfig.html

Hdfs general rule General Rules in

Reference link: on the generality of dfs.namenode.handler.count configuration parameters

Https://blog.csdn.net/turk/article/details/79723963

The formula is:

Python-c 'import math; print int (math.log (N) 20)'

# N number of cluster servers

/ / 4 nodes are 27. The calculation process is

[root@cdh-master-130-201conf] # python-c 'import math; print int (math.log (4) 20)'

twenty-seven

[root@cdh-master-130-201conf] # python-c 'import math; print int (math.log (44) 20)'

seventy-five

2. The number of dfs.datanode.handler.count:3 DataNode server threads.

The number of threads used to process RPC calls in DataNode. The default is 3. You can increase this value appropriately to increase the concurrency of DataNode RPC services (recommended value: 20). Note: the increase in the number of threads will increase the memory requirements of DataNode

3. The default dfs.replication replication factor is 3.

4. The default block size of dfs.block.size HDFS is 128m.

Setting the block too small will increase the pressure on NameNode. Oversetting the block will increase the time it takes to locate the data. This value is related to your disk speed. I explained in my blog why it is 128m. In fact, it has something to do with the speed of the disk. We can customize this block size, considering two factors. First, what is the approximate scope of your cluster files? If the files are basically 64M~128M or so, it is recommended that you do not modify them. If most of the files are between 200M~256M, you can change the configuration block size to 256. of course, you also have to consider your disk read and write performance.

5. Dfs.datanode.data.dir remains unchanged

HDFS data storage directory. Distributing the data storage on each disk can make full use of the read and write performance of the node. So in a real production environment, that's why we don't choose RAID and LVM instead of JBOD. It is recommended to set multiple disk directories to increase the performance of disk IO. Multiple directories are separated by commas.

6. Io.file.buffer.size is modified in yarn

The HDFS file buffer size, which defaults to 4096 (4K). Recommended value: 131072 (128K). You have to edit the core-site.xml configuration file. If you use CDH, you can modify it directly in the YARN service.

7. Fs.trash.interval file system spam interval

The time period in minutes for HDFS to clean up the Recycle Bin. The default is 0, which means that the Recycle Bin property is not used. It is recommended to open it, and you can define the time by yourself. You can recommend it for 4 or 7 days.

8. Dfs.datanode.du.reserved is suitable for reserved space (bytes per volume) used by non-distributed file systems (DFS).

DataNode reserves the amount of space in bytes. By default, DataNode takes up all available disk space, and this configuration item allows DataNode to reserve some disk space for use by other applications. This depends on the specific application, it is recommended to make a little space, 5G~10G can be.

The default is 5G

9. Rack perception

10 、

The maximum number of dfs.datanode.max.xcievers transmission threads specifies the maximum number of threads used to transfer data inside and outside the DataNode.

This value is the maximum number of files that can be processed by datanode at the same time. It is recommended to increase this value. The default value is 256and the maximum value can be configured to 65535.

11 avoid dirty read and write operations / / on

Dfs.namenode.avoid.read.stale.datanode

Dfs.namenode.avoid.write.stale.datanode

12. Service Monitor will use the list of configuration attributes. Search for "dfs.datanode.socket". The default is 3 seconds. Here, I changed "dfs.socket.timeout" and "dfs.datanode.socket.write.timeout" to 3000s.

13 、

DataNode balanced bandwidth

Dfs.balance.bandwidthPerSec, dfs.datanode.balance.bandwidthPerSec the maximum bandwidth that each DataNode can be used to balance. In bytes per second

/ / judged by the cluster network of each company, the mutual transmission between datanode

14 、

Copy the work multiplier based on iteration settings

Dfs.namenode.replication.work.multiplier.per.iteration

Increase the value of the copy work multiplier based on iterative settings (the default is 2, however the recommended value is 10)

15.

Maximum number of replication threads on DataNode

Dfs.namenode.replication.max-streams recommended value of 50

Hard limit on the number of replication threads on Datanode

Dfs.namenode.replication.max-streams-hard-limit recommended value of 100

The recommended value is 50 amp 100.

16 、

Fs.trash.checkpoint.interval

The garbage collection check interval in minutes. Should be less than or equal to fs.trash.interval. If 0, the value is equivalent to fs.trash.interval. Each time the inspector runs, a new checkpoint is created.

The recommended value is 1 hour

/ / specify the Filesystem Trash Interval property, which controls the number of minutes to delete garbage checkpoint directories and the number of minutes between garbage checkpoints. For example, to enable the trash can to delete deleted files after 24 hours, set the value of the Filesystem Trash Interval property to 1440.

Note: the dustbin interval is calculated from the location where the file is moved to the trash can, not from the location where the file was last modified.

17 、

HDFS High Availability defense methods remain unchanged

Dfs.ha.fencing.methods

Explanation: a list of defense methods for service defense. Shell (. / cloudera_manager_agent_fencer.py) is a defense mechanism designed to use Cloudera Manager Agent. The sshfence method uses SSH. If you use custom defenses (which may communicate with shared storage, power devices, or network switches), invoke them using shell.

[root@NewCDH-0--141 ~] # ls-l / run/cloudera-scm-agent/process/2428-hdfs-NAMENODE/cloudera_manager_agent_fencer.py

-rwxr- 1 hdfs hdfs 2149 Mar 21 15:51 / run/cloudera-scm-agent/process/2428-hdfs-NAMENODE/cloudera_manager_agent_fencer.py

The reference links are:

Https://blog.csdn.net/fromthewind/article/details/84106341

18 、

Dfs.datanode.hdfs-blocks-metadata.enabled-for each HDFS service, set to true if there is an Impala service in the cluster. This rule is irregular; it can be triggered even if the HDFS service is out of scope.

19. Dfs.client.read.shortcircuit-for each HDFS service, set to true if there is an Impala service in the cluster. This rule is irregular; it can be triggered even if the HDFS service is out of scope.

/ / that is, if the hdfs files are on the local machine, reading them locally without going to the network improves the efficiency of hbase or impala and helps to improve the random read configuration files and Impala performance of HBase.

20. Dfs.datanode.data.dir.perm-for each DataNode, if there is an Impala service in the cluster and the cluster is not Kerberized, it is set to 755. This rule is irregular; it can be triggered even if the HDFS service is out of scope.

21 fs.trash.interval? for each HDFS service, set to 1 day (file system garbage interval 1 day).

22 set the maximum file character for the service

Minimum required role: configuration program (also provided by cluster administrator, full administrator)

You can set the maximum file descriptor parameter for all daemon roles. When not specified, the role uses any values inherited from the supervisor. When specified, configure the soft and hard limits to the configured values.

To serve.

Click the configuration tab.

In the search box, type rlimit_fds.

Set the maximum process File descriptor property for one or more roles.

Click Save changes to commit the changes.

Restart the affected role instance.

/ / this is very important, not only for hdfs, but for all services on cloudera, we can or need to set the maximum file descriptor to 65535.

Reference link:

HDFS Cluster Optimization Section https://www.cnblogs.com/yinzhengjie/p/10006880.html

[configure CDH and Management Services] tuning HDFS before closing DataNode: https://blog.csdn.net/a118170653/article/details/42774599

Reference link: https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cm_mc_autoconfig.html

Reference link: https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cm_mc_max_fd.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.