What are the important monitoring indicators of HDFS? 07/12 Update SLTechnology News&Howtos

What are the important monitoring indicators of HDFS?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is to share with you what are the important monitoring indicators of HDFS. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

I. HDFS Monitoring Challenge

HDFS is a part of Hadoop ecology. The monitoring scheme should be applicable not only to HDFS, but also to other components such as Yarn, Hbase, Hive, etc.

There are many indicators provided by HDFS API, and some of them need not be collected in real time, but they need to be obtained quickly in case of failure.

Log of Hadoop related components, such as problem location, audit, etc.

The monitoring scheme can not only satisfy the monitoring itself, but also cover the indicators involved in fault location.

II. Hadoop monitoring scheme

Hadoop monitoring data collection is through HTTP API, or JMX. In practice, the main products used are: CDH, Ambari, in addition, there are some tools, such as Jmxtrans, HadoopExporter (for Prometheus).

CDH is an open source Hadoop ecological component management tool that integrates deployment, monitoring and operation. It also provides a paid version (more features such as data backup and recovery, fault location, etc.) than the free version. The HDFS monitoring interface provided by CDH is excellent in experience, and it is concentrated after in-depth exploration of HDFS monitoring metrics, such as HDFS capacity, read and write traffic and time consuming, Datanode disk refresh time, and so on.

Figure 1 HDFS monitoring interface provided by CDH

Like CDH, Ambari is also an open source tool, but it has better scalability. In addition, its information can be displayed from different dimensions, such as machines, components, clusters, and so on, which is close to the habits of operation and maintenance engineers.

Figure 2 HDFS monitoring interface provided by Ambari

If you use CDH or Ambari for HDFS monitoring, there are also practical problems:

The corresponding versions of Hadoop and related components cannot be customized.

It can not well meet the actual monitoring needs of large-scale HDFS clusters.

Other tools, such as Jmxtrans, do not adapt well to Hadoop at present. Therefore, the actual monitoring solution selection is:

Collection: HadoopExporter,Hadoop HTTP API (description: HDFS mainly calls http://{domain}:{port}/jmx)

Logs: collecting and analyzing through ELK

Storage: Prometheus

Show: Grafana,HDFS UI,Hue

Alarm: docking JD.com cloud alarm system

Third, HDFS monitoring indicators

1. Overview of key indicators

Table 1 Overview of main monitoring metrics of HDFS

2. Black box monitoring index

Basic function

Whether there are any functional abnormalities throughout the life cycle of the file mainly monitors the actions of creating, viewing, modifying and deleting.

When viewing, you need to proofread the content. There is a way to write a timestamp in the file and check the timestamp when viewing it. In this way, you can judge whether to write timeout according to the time difference.

Remember to ensure the integrity of the life cycle, otherwise, a large number of temporary files generated by monitoring may cause the HDFS cluster to collapse.

3. White box monitoring index

1) error

Number of Block lost

Collection item: MissingBlocks

If a block loss occurs, it means that the file has been corrupted, so you need to anticipate the risk of Block loss before the block is lost (by monitoring the UnderReplicatedBlocks).

Proportion of unavailable data nodes

Collection items:

IsGoodTarget in BlockPlacementPolicyDefault.java defines the strategy of selecting Datanode nodes, two of which are "whether the node is offline" and "whether there is enough storage space". If too many nodes are not available, it may lead to the selection of unhealthy Datanode. Therefore, a certain amount of healthy Datanode must be guaranteed.

Fig. 4 partial judgment conditions when selecting available Datanode

Error log keyword monitoring

Some common error monitoring (mainly monitoring Exception/ERROR), corresponding to keywords:

IOException 、 NoRouteToHostException 、 SafeModeException 、 UnknownHostException .

Number of Block not copied

Collection item: UnderReplicatedBlocks

UnderReplicatedBlocks will produce a large number of synchronized blocks in data node offline, data node failure and so on.

FGC monitoring

Collection item: FGC

Success rate of reading and writing

Collection items:

Monitor_write.status/monitor_read.status

According to the actual Block read and write traffic aggregation calculation, it is an important basis for external SLA indicators.

Data disk failure

Collection item: NumFailedVolumes

If a cluster has 1000 hosts, each with 12 disks (standard configuration for general storage machines), it will be 10, 2000 data disks. Based on the aPCge quarterly failure rate of 1.65% of mechanical disks (statistics from data storage service provider Backblaze), there are an average of 7 disks per month. If the scale of the cluster expands again, the operation and maintenance engineers will spend a lot of energy on fault disk handling and service recovery. Obviously, a set of automatic data disk fault detection, automatic repair, service automatic recovery mechanism has become a rigid demand.

In addition to the fault disk monitoring, the fault data disk should have an overall solution. In practice, take the scene as the dimension, through the way of self-help to deal with this problem.

Figure 5 Jenkins self-service task based on scenario implementation

2) Traffic

The number of Block reads and writes

Collection items:

Collect Datanode data for aggregation calculation.

Network inbound and outgoing traffic

Collection item: node_network_receive_bytes_total/ node_network_transmit_bytes_total

There is no ready-made data that can be used directly, so it needs to be calculated by ReceivedBytes (total number of bytes received) and SentBytes (total amount of bytes sent).

Disk IPUBO

Collection item: node_disk_written_bytes_total/ node_disk_read_bytes_total

3) delay

Average RPC processing time

Collection item: RpcQueueTimeAvgTime

Collect RpcQueueTimeAvgTime (average RPC processing time) and SyncsAvgTime (Journalnode synchronization time).

Number of slow nodes

Collection item: SlowPeerReports

The main feature of the slow node is that the reading and writing falling on the node has a larger gap than the average, but it can still return the correct result if it is given enough time. In addition to the machine hardware and network, the heavy load on the corresponding node is another main reason that usually leads to the emergence of slow nodes. In the actual monitoring, in addition to monitoring the read and write time on the node, the load on the node also needs to be monitored.

According to the actual needs, you can flexibly adjust the Datanode reporting time, or turn on "stale node" (Stale Node) detection, so that Namenode can accurately identify fault instances. Some configuration items are involved:

Dfs.namenode.heartbeat.recheck-interval

Dfs.heartbeat.interval

Dfs.namenode.avoid.read.stale.datanode

Dfs.namenode.avoid.write.stale.datanode

Dfs.namenode.stale.datanode.interval

4) capacity

Total space and space utilization of cluster

Collection item: PercentUsed

HDFS UI spends a lot of time showing storage space-related metrics to illustrate its importance.

The space usage calculation includes the node space in the "offline", which is a trap. If some nodes are offline, but the space they represent is still calculated in the total space, if there are too many offline nodes, there is a "strange phenomenon": there is a lot of space left in the cluster, but there is no space to write.

In addition, in the Datanode space planning, it is necessary to reserve some space. The HDFS reserved space may be used by other programs, or it may be after the file has been deleted, but it has been referenced all the time. If "Non DFS Used" keeps increasing, you need to trace the specific reason and optimize it. You can set the reserved space with the following parameters:

Dfs.datanode.du.reserved.calculator

Dfs.datanode.du.reserved

Dfs.datanode.du.reserved.pct

As a HDFS operation and maintenance developer, you need to be aware of this formula: Configured Capacity = Total Disk Space-Reserved Space = Remaining Space + DFS Used + Non DFS Used.

Namenode heap memory utilization

Collection items:

HeapMemoryUsage.used/HeapMemoryUsage.committed

It is not too much to take this indicator as the core indicator of HDFS. Metadata and Block mapping take up most of the heap memory of Namenode, which is one of the reasons why HDFS is not suitable for storing a large number of small files. Excessive heap memory usage may lead to slow Namenode startup and potential FGC risk, so heap memory usage needs to be monitored.

In practice, heap memory usage increases, which is inevitable. Several effective solutions are given:

Adjust heap memory allocation

Establish a document life cycle management mechanism to clean up some useless documents in time

Small file merge

Use HDFS Federation scale-out

Although these measures can effectively reduce the risk for a long time, it is also necessary to plan the cluster in advance.

Data equalization degree

Collection items:

As far as HDFS is concerned, the balance of data storage determines its security to a certain extent. In practice, the standard deviation of this group of data is calculated according to the space utilization of each storage instance, which is used to feedback the degree of data balance between each instance.

In the case of large data, it will be time-consuming to equalize the data, although it is difficult to complete the data equalization quickly by adjusting the concurrency and speed. In view of this situation, you can try to give priority to instances that have run out of offline space, and then expand the capacity to achieve balance.

It should also be noted that before version 3.0, data equalization can only be the balance between nodes, not the balance of different data disks within the node.

Length of RPC request queue

Collection item: CallQueueLength (length of RPC request queue).

Number of Fil

Collection item: FilesTotal

Used with heap memory usage. Each file system object (including files, directories, and the number of Block) occupies at least 150 bytes of heap memory, from which you can roughly estimate how many files a Namenode can save. The block size can also be optimized according to the relationship between the file and the number of blocks.

Number of offline instances

Collection item: NumDecommissioningDataNodes

When the scale of HDFS cluster is large, real-time health examples say that repairing faulty nodes regularly and going online in time can save the company a certain amount of cost.

5) other

In addition to the above main indicators, general monitoring policies such as server, process JVM, dependent services (Zookeeper, DNS) also need to be added.

4. HDFS monitoring landing

Grafana dashboard display: mainly used for service inspection and fault location (description: Grafana official HDFS monitoring template, relatively few data indicators).

Figure 6 HDFS partial cluster Grafana dashboard

ELK-Hadoop: mainly used for global log retrieval and error log keyword monitoring.

Figure 7 search for HDFS cluster logs in ES

Figure 8. Log service searches HDFS cluster logs

Hue, HDFS UI: mainly used for HDFS problem troubleshooting and daily maintenance.

V. HDFS case

Case 1

DNS generates dirty data, resulting in Namenode HA failure.

Discovery method: function monitoring, abnormal SLA index

Cause of failure: dirty data generated by the DNS server, resulting in an error in the Namenode hostname, and failed to find the wrong host during HA handover

Optimization suggestion: as the most basic service, DNS must ensure that its data is correct and stable. In the case of a certain scale, do not use modification / etc/hosts to solve the hostname problem. If there is no highly available internal DNS service, it is recommended to use DNSMasq to build a set of DNS server.

Case 2

The rack is not grouped properly, causing HDFS to fail to write.

Discovery method: function monitoring writes abnormal occasional alarm

Cause of failure: HDFS enables rack awareness, unreasonable allocation of machine resources for different packets, depletion of some packet storage resources, and no available nodes are found when selecting Datanode

Optimization suggestion: allocate the number of instances on each rack reasonably and monitor them in groups. In small-scale cases, consider turning off rack awareness

Thank you for reading! This is the end of this article on "what are the important monitoring indicators of HDFS?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.