In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article is to share with you what are the important monitoring indicators of HDFS. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
I. HDFS Monitoring Challenge
HDFS is a part of Hadoop ecology. The monitoring scheme should be applicable not only to HDFS, but also to other components such as Yarn, Hbase, Hive, etc.
There are many indicators provided by HDFS API, and some of them need not be collected in real time, but they need to be obtained quickly in case of failure.
Log of Hadoop related components, such as problem location, audit, etc.
The monitoring scheme can not only satisfy the monitoring itself, but also cover the indicators involved in fault location.
II. Hadoop monitoring scheme
Hadoop monitoring data collection is through HTTP API, or JMX. In practice, the main products used are: CDH, Ambari, in addition, there are some tools, such as Jmxtrans, HadoopExporter (for Prometheus).
CDH is an open source Hadoop ecological component management tool that integrates deployment, monitoring and operation. It also provides a paid version (more features such as data backup and recovery, fault location, etc.) than the free version. The HDFS monitoring interface provided by CDH is excellent in experience, and it is concentrated after in-depth exploration of HDFS monitoring metrics, such as HDFS capacity, read and write traffic and time consuming, Datanode disk refresh time, and so on.
Figure 1 HDFS monitoring interface provided by CDH
Like CDH, Ambari is also an open source tool, but it has better scalability. In addition, its information can be displayed from different dimensions, such as machines, components, clusters, and so on, which is close to the habits of operation and maintenance engineers.
Figure 2 HDFS monitoring interface provided by Ambari
If you use CDH or Ambari for HDFS monitoring, there are also practical problems:
The corresponding versions of Hadoop and related components cannot be customized.
It can not well meet the actual monitoring needs of large-scale HDFS clusters.
Other tools, such as Jmxtrans, do not adapt well to Hadoop at present. Therefore, the actual monitoring solution selection is:
Collection: HadoopExporter,Hadoop HTTP API (description: HDFS mainly calls http://{domain}:{port}/jmx)
Logs: collecting and analyzing through ELK
Storage: Prometheus
Show: Grafana,HDFS UI,Hue
Alarm: docking JD.com cloud alarm system
Third, HDFS monitoring indicators
1. Overview of key indicators
Table 1 Overview of main monitoring metrics of HDFS
2. Black box monitoring index
Basic function
Whether there are any functional abnormalities throughout the life cycle of the file mainly monitors the actions of creating, viewing, modifying and deleting.
When viewing, you need to proofread the content. There is a way to write a timestamp in the file and check the timestamp when viewing it. In this way, you can judge whether to write timeout according to the time difference.
Remember to ensure the integrity of the life cycle, otherwise, a large number of temporary files generated by monitoring may cause the HDFS cluster to collapse.
3. White box monitoring index
1) error
Number of Block lost
Collection item: MissingBlocks
If a block loss occurs, it means that the file has been corrupted, so you need to anticipate the risk of Block loss before the block is lost (by monitoring the UnderReplicatedBlocks).
Proportion of unavailable data nodes
Collection items:
IsGoodTarget in BlockPlacementPolicyDefault.java defines the strategy of selecting Datanode nodes, two of which are "whether the node is offline" and "whether there is enough storage space". If too many nodes are not available, it may lead to the selection of unhealthy Datanode. Therefore, a certain amount of healthy Datanode must be guaranteed.
Fig. 4 partial judgment conditions when selecting available Datanode
Error log keyword monitoring
Some common error monitoring (mainly monitoring Exception/ERROR), corresponding to keywords:
IOException 、 NoRouteToHostException 、 SafeModeException 、 UnknownHostException .
Number of Block not copied
Collection item: UnderReplicatedBlocks
UnderReplicatedBlocks will produce a large number of synchronized blocks in data node offline, data node failure and so on.
FGC monitoring
Collection item: FGC
Success rate of reading and writing
Collection items:
Monitor_write.status/monitor_read.status
According to the actual Block read and write traffic aggregation calculation, it is an important basis for external SLA indicators.
Data disk failure
Collection item: NumFailedVolumes
If a cluster has 1000 hosts, each with 12 disks (standard configuration for general storage machines), it will be 10, 2000 data disks. Based on the aPCge quarterly failure rate of 1.65% of mechanical disks (statistics from data storage service provider Backblaze), there are an average of 7 disks per month. If the scale of the cluster expands again, the operation and maintenance engineers will spend a lot of energy on fault disk handling and service recovery. Obviously, a set of automatic data disk fault detection, automatic repair, service automatic recovery mechanism has become a rigid demand.
In addition to the fault disk monitoring, the fault data disk should have an overall solution. In practice, take the scene as the dimension, through the way of self-help to deal with this problem.
Figure 5 Jenkins self-service task based on scenario implementation
2) Traffic
The number of Block reads and writes
Collection items:
Collect Datanode data for aggregation calculation.
Network inbound and outgoing traffic
Collection item: node_network_receive_bytes_total/ node_network_transmit_bytes_total
There is no ready-made data that can be used directly, so it needs to be calculated by ReceivedBytes (total number of bytes received) and SentBytes (total amount of bytes sent).
Disk IPUBO
Collection item: node_disk_written_bytes_total/ node_disk_read_bytes_total
3) delay
Average RPC processing time
Collection item: RpcQueueTimeAvgTime
Collect RpcQueueTimeAvgTime (average RPC processing time) and SyncsAvgTime (Journalnode synchronization time).
Number of slow nodes
Collection item: SlowPeerReports
The main feature of the slow node is that the reading and writing falling on the node has a larger gap than the average, but it can still return the correct result if it is given enough time. In addition to the machine hardware and network, the heavy load on the corresponding node is another main reason that usually leads to the emergence of slow nodes. In the actual monitoring, in addition to monitoring the read and write time on the node, the load on the node also needs to be monitored.
According to the actual needs, you can flexibly adjust the Datanode reporting time, or turn on "stale node" (Stale Node) detection, so that Namenode can accurately identify fault instances. Some configuration items are involved:
Dfs.namenode.heartbeat.recheck-interval
Dfs.heartbeat.interval
Dfs.namenode.avoid.read.stale.datanode
Dfs.namenode.avoid.write.stale.datanode
Dfs.namenode.stale.datanode.interval
4) capacity
Total space and space utilization of cluster
Collection item: PercentUsed
HDFS UI spends a lot of time showing storage space-related metrics to illustrate its importance.
The space usage calculation includes the node space in the "offline", which is a trap. If some nodes are offline, but the space they represent is still calculated in the total space, if there are too many offline nodes, there is a "strange phenomenon": there is a lot of space left in the cluster, but there is no space to write.
In addition, in the Datanode space planning, it is necessary to reserve some space. The HDFS reserved space may be used by other programs, or it may be after the file has been deleted, but it has been referenced all the time. If "Non DFS Used" keeps increasing, you need to trace the specific reason and optimize it. You can set the reserved space with the following parameters:
Dfs.datanode.du.reserved.calculator
Dfs.datanode.du.reserved
Dfs.datanode.du.reserved.pct
As a HDFS operation and maintenance developer, you need to be aware of this formula: Configured Capacity = Total Disk Space-Reserved Space = Remaining Space + DFS Used + Non DFS Used.
Namenode heap memory utilization
Collection items:
HeapMemoryUsage.used/HeapMemoryUsage.committed
It is not too much to take this indicator as the core indicator of HDFS. Metadata and Block mapping take up most of the heap memory of Namenode, which is one of the reasons why HDFS is not suitable for storing a large number of small files. Excessive heap memory usage may lead to slow Namenode startup and potential FGC risk, so heap memory usage needs to be monitored.
In practice, heap memory usage increases, which is inevitable. Several effective solutions are given:
Adjust heap memory allocation
Establish a document life cycle management mechanism to clean up some useless documents in time
Small file merge
Use HDFS Federation scale-out
Although these measures can effectively reduce the risk for a long time, it is also necessary to plan the cluster in advance.
Data equalization degree
Collection items:
As far as HDFS is concerned, the balance of data storage determines its security to a certain extent. In practice, the standard deviation of this group of data is calculated according to the space utilization of each storage instance, which is used to feedback the degree of data balance between each instance.
In the case of large data, it will be time-consuming to equalize the data, although it is difficult to complete the data equalization quickly by adjusting the concurrency and speed. In view of this situation, you can try to give priority to instances that have run out of offline space, and then expand the capacity to achieve balance.
It should also be noted that before version 3.0, data equalization can only be the balance between nodes, not the balance of different data disks within the node.
Length of RPC request queue
Collection item: CallQueueLength (length of RPC request queue).
Number of Fil
Collection item: FilesTotal
Used with heap memory usage. Each file system object (including files, directories, and the number of Block) occupies at least 150 bytes of heap memory, from which you can roughly estimate how many files a Namenode can save. The block size can also be optimized according to the relationship between the file and the number of blocks.
Number of offline instances
Collection item: NumDecommissioningDataNodes
When the scale of HDFS cluster is large, real-time health examples say that repairing faulty nodes regularly and going online in time can save the company a certain amount of cost.
5) other
In addition to the above main indicators, general monitoring policies such as server, process JVM, dependent services (Zookeeper, DNS) also need to be added.
4. HDFS monitoring landing
Grafana dashboard display: mainly used for service inspection and fault location (description: Grafana official HDFS monitoring template, relatively few data indicators).
Figure 6 HDFS partial cluster Grafana dashboard
ELK-Hadoop: mainly used for global log retrieval and error log keyword monitoring.
Figure 7 search for HDFS cluster logs in ES
Figure 8. Log service searches HDFS cluster logs
Hue, HDFS UI: mainly used for HDFS problem troubleshooting and daily maintenance.
V. HDFS case
Case 1
DNS generates dirty data, resulting in Namenode HA failure.
Discovery method: function monitoring, abnormal SLA index
Cause of failure: dirty data generated by the DNS server, resulting in an error in the Namenode hostname, and failed to find the wrong host during HA handover
Optimization suggestion: as the most basic service, DNS must ensure that its data is correct and stable. In the case of a certain scale, do not use modification / etc/hosts to solve the hostname problem. If there is no highly available internal DNS service, it is recommended to use DNSMasq to build a set of DNS server.
Case 2
The rack is not grouped properly, causing HDFS to fail to write.
Discovery method: function monitoring writes abnormal occasional alarm
Cause of failure: HDFS enables rack awareness, unreasonable allocation of machine resources for different packets, depletion of some packet storage resources, and no available nodes are found when selecting Datanode
Optimization suggestion: allocate the number of instances on each rack reasonably and monitor them in groups. In small-scale cases, consider turning off rack awareness
Thank you for reading! This is the end of this article on "what are the important monitoring indicators of HDFS?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.