How to use Arthas to diagnose HBase abnormal process 04/22 Update SLTechnology News&Howtos

How to use Arthas to diagnose HBase abnormal process

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how to use Arthas to diagnose abnormal processes in HBase. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Abnormal protuberance

The CPU utilization of one of the RegionServer in the HBase cluster suddenly soared to 100%, and after restarting the RegionServer alone, the CPU load will still gradually reach its peak. After many times of restarting the cluster, the phenomenon of full load of CPU will still reappear, and will remain high. Slowly, the RegionServer will go down, and slowly the HBase cluster will be over.

two。 A phenomenon above an anomaly.

From the CDH monitoring page, almost all the core indicators except CPU are normal, disk and network IO are low, memory is sufficient, and it is normal to compress and refresh queues.

Prometheus' monitoring is similar to this, so it won't be posted.

The figures in the monitoring indicators can only tell us the phenomenon intuitively, not the cause of the anomaly. So our second reaction is to read the diary.

At the same time, there are many interference outputs like this in the log.

Later, it was found that the output was just some irrelevant information, which did not help to analyze the problem, and even interfered with our positioning of the problem.

However, a large number of scan responseTooSlow warnings in the log seem to tell us that a large number of time-consuming scan operations are taking place inside HBase's Server, which may be the culprit of the high CPU load. However, due to the role of various factors, we did not focus on this at that time, because such information, we also frequently come across in the historical period of time.

3. First acquaintance of arthas

Neither monitoring nor logging can make us 100% sure which operations cause the high CPU load. With the top command, we can only see that the HBase process consumes a lot of CPU, as shown in the following figure.

The command top locates the abnormal HBase process ID is 1214, which is the process of HRegionServer. Enter the serial number 1, enter, and enter the command line interface that monitors the process.

4.3 dashboard

Enter the thread command to view the execution of all threads under the process.

4.5 thread-n 3

Within 5 seconds per unit time, resources occupy the top three threads.

4.7 use async-profiler to generate flame diagrams

The simplest command to generate a flame diagram.

Profiler start

Once in a while, about 30 seconds.

Profiler stop

Entry-level knowledge of flame diagrams:

View the jvm process cpu Fire Map tool.

In the flame diagram, it is clear that the threads with the highest CPU time consumption are those with the longest green box, that is, scan operations.

5. Excessive CPU load caused by scan operation

Through the above process analysis, we can finally determine that the occurrence of scan operations leads to a high CPU load. The API of our query HBase is based on happybase encapsulation.

In fact, regular scan operations can return results normally, and the tables with abnormal queries are not very large, so we rule out the possibility of hot spots. Abstract the query logic of the business side is:

From happybase.connection import Connectionimport timestart = time.time () con = Connection (host='ip', port=9090, timeout=3000) table = con.table ("table_name") try: res = list (filter= "PrefixFilter ('273810955 |')", row_start='\ x0f\ x10roomR\ xca\ xdf\ x96\ xcb\ xe2\ xad7 $\ xad9khE\ x19\ xfd\ xaa\ xa5\ xdd\ xf7\ x85\ X1c\ x81ku ^\ x92k' Limit=3)) except Exception as e: passend = time.time () print 'timeout:% d'% (end-start)

The combination of PrefixFilter and row_start is to meet the requirements of paging queries. A pile of garbled characters in row_start is an encrypted user_id with special characters in it. As you can see in the log, all time-consuming queries have passed parameters of such garbled characters. Therefore, we suspect that the exception in the query is related to these garbled characters.

However, when the subsequent test was repeated, it was found that.

# will time out res = list (table.scan (filter= "PrefixFilter ('273810955 |')", row_start='27', limit=3)) # No time out res = list (table.scan (filter= "PrefixFilter ('273810955 |')", row_start='27381095', limit=3))

That is, even if it is not garbled characters to pass parameters, and the combination of filter and row_start is abnormal, it will cause the CPU exception to be high. If the specified value of row_start is too small and smaller than the prefix, the data scanning range is estimated to be larger. Similar to triggering a full table scan, CPUload is bound to become larger.

6. Frequent creation of connections or use of thread pools result in continuous growth of scan threads

The common code for us to operate HBase is encapsulated by happybase, which also uses the thread pool of happybase. In our more in-depth test, we found another phenomenon. When we use connection pool or repeatedly create connections in the loop, and then use arthas to monitor the thread situation, we find that the thread of scan will be very serious. The test code is as follows:

6.1 connections are created outside the loop Reuse from happybase.connection import Connectionimport timecon = Connection (host='ip', port=9090, timeout=2000) table = con.table ("table") for i in range: try: start = time.time () res = list (table.scan (filter= "PrefixFilter ('273810955 |')" Row_start='\ x0f\ x10roomR\ xca\ xdf\ x96\ xcb\ xe2\ xad7 $\ xad9khE\ x19\ xfd\ xaa\ x87\ xa5\ xdd\ xf7\ x85\ x1c\ x81ku ^\ x92kcow, limit=3) except Exception as e: pass end = time.time () print 'timeout:% d'% (end-start)

When the program starts to run, you can open arthas to monitor the HRegionServer process, run the thread command, and view the thread usage at this time:

Some of them are running, most of them are waiting. At this point, the load of CPU:

6.2 loops are frequently created internally and then used

The code is as follows:

From happybase.connection import Connectionimport timefor i in range: try: start = time.time () con = Connection (host='ip', port=9090, timeout=2000) table = con.table ("table") res = list (table.scan (filter= "PrefixFilter ('273810955 |')" Row_start='\ x0f\ x10roomR\ xca\ xdf\ x96\ xcb\ xe2\ xad7 $\ xad9khE\ x19\ xfd\ xaa\ x87\ xa5\ xdd\ xf7\ x85\ x1c\ x81ku ^\ x92kcow, limit=3) except Exception as e: pass end = time.time () print 'timeout:% d'% (end-start)

In the following figure, you can see that there are more and more threads starting RUNNING, and the consumption of CPU is also increasing.

6.3 access to HBase by connection pooling

CPU was pulled up by the previous experiment, and the cluster was restarted to restore the state of CPU to its previous stable state. Then continue our testing, testing the code:

There is no timeout

From happybase import ConnectionPoolimport timepool = ConnectionPool (size=1, host='ip', port=9090) for i in range: start = time.time () try: with pool.connection (2000) as con: table = con.table ("table") res = list (table.scan (filter= "PrefixFilter ('273810955 |')" Row_start='\ x0f\ x10roomR\ xca\ xdf\ x96\ xcb\ xe2\ xad7 $\ xad9khE\ x19\ xfd\ xaa\ x87\ xa5\ xdd\ xf7\ x85\ x1c\ x81ku ^\ x92kcow, limit=3) except Exception as e: pass end = time.time () print 'timeout:% d'% (end-start)

If you do not specify a timeout, only one thread will run continuously because my connection pool is set to 1.

Specify timeout

From happybase import ConnectionPoolimport timepool = ConnectionPool (size=1, host='ip', port=9090, timeout=2000) for i in range: start = time.time () try: with pool.connection (2000) as con: table = con.table ("table") res = list (table.scan (filter= "PrefixFilter ('273810955 |')" Row_start='\ x0f\ x10roomR\ xca\ xdf\ x96\ xcb\ xe2\ xad7 $\ xad9khE\ x19\ xfd\ xaa\ x87\ xa5\ xdd\ xf7\ x85\ x1c\ x81ku ^\ x92kcow, limit=3) except Exception as e: pass end = time.time () print 'timeout:% d'% (end-start)

In this test, I specified the timeout in the connection pool, expecting the connection to time out, disconnect in time, and continue with the next time-consuming query. At this point, the server handles the thread of the scan request:

7. Hbase.regionserver.handler.count

Referring to God's blog and his own understanding of this parameter, when each RPC request (read or write) initiated by a client is sent to the server, the server will have a thread pool dedicated to processing these client requests. This thread pool ensures that 30 threads can run at the same time, and the remaining requests are either blocked or jammed into the queue waiting to be processed. Scan requests fill the thread pool of the server, a large number of time-consuming operations, CPU resources are exhausted, other regular read and write requests are bound to be greatly affected, and the cluster will be finished slowly.

8. Control scan requests take up a very small queue

First of all, the parameter of this hbase.regionserver.handler.count cannot be reduced. If it is too small, when the cluster concurrency is high, the read and write latency must be high, because most of the requests are queued. Ideally, reads and writes occupy different thread pools, and scan and get occupy different thread pools when processing read requests, thus isolating thread pool resources. If it were me, the first reaction might also be simple and rough thread pool, write thread pool, get thread pool, scan thread pool. The scan thread pool allocates very small core threads to take up very small resources, limiting their unlimited expansion. But is this really the case? For the time being, I haven't studied the source code carefully. HBase provides the following parameters to meet the requirements of separation of read and write resources.

Hbase.regionserver.handler.count

Describes the number of RPC listener instances rotated on RegionServer. The host uses the same properties for the count of host handlers. Too many handlers can be counterproductive. Make it a multiple of the CPU count. If it is read-only for the most part, the handler count is close to the cpu count. Start with twice the CPU count and adjust from there. The default is 30

Hbase.ipc.server.callqueue.handler.factor

Describes factors that determine the number of call queues. A value of 0 means that a queue is shared among all handlers. A value of 1 means that each handler has its own queue. Default 0.1

Hbase.ipc.server.callqueue.read.ratio

Describes dividing the call queue into read and write queues. The specified interval (which should be between 0.0 and 1.0) will be multiplied by the number of call queues. A value of 0 means that the call queue is not split, which means that both read and write requests are pushed to the same set of queues. A value less than 0.5 indicates that the read queue is less than the write queue. A value of 0.5 means that there will be the same number of read and write queues. A value greater than 0.5 indicates that there will be more read queues than write queues. A value of 1.0 means that all but one queue is used to schedule read requests. Example: the total number of call queues given is 10, and a read ratio of 0 means that 10 queues will contain two read / write requests. A read.ratio of 0. 3 means that three queues will contain only read requests and seven queues will contain only write requests. A read.ratio of 0. 5 means that five queues contain only read requests and five queues contain only write requests. A read.ratio of 0. 8 means that eight queues will contain only read requests and two queues will contain only write requests. A read.ratio of 1 means that nine queues will contain only read requests and one queue will contain only write requests. Default 0

Hbase.ipc.server.callqueue.scan.ratio

Describing the number of read call queues (calculated based on the total number of call queues multiplied by callqueue.read.ratio), the scan.ratio property divides the read call queue into small read queues and long read queues. A value less than 0.5 indicates that the long read queue is less than the short read queue. A value of 0.5 means that there will be the same number of short and long read queues. A value greater than 0.5 indicates that there are more long read queues than short read queues. A value of 0 or 1 means that the same queue is used for fetching and scanning. Example: assuming that the total number of read call queues is 8, a scan.ratio of 0 or 1 means that 8 queues will contain both long and short read requests. A scan.ratio of 0. 3 means that 2 queues will contain only long read requests and 6 queues will contain only short read requests. A scan.ratio of 0. 5 means that 4 queues will contain only long read requests and 4 queues will contain only short read requests. A scan.ratio of 0. 8 means that 6 queues will contain only long read requests and 2 queues will contain only short read requests. Default 0

The functions of these parameters are explained in detail on the official website. According to the meaning of configuring a certain proportion, you can achieve the purpose of separating read / write queues, get and scan queues. However, after deploying the parameters, continue to test as above, and find that it is not difficult to control the number of threads in RUNNING and find that there is no hair.

Here is a question: what is the direct relationship between queues and thread pools as I understand them? Is it the same thing? After this, we need to look at its source code and see its essence.

Thank you for reading! This is the end of the article on "how to use Arthas to diagnose the abnormal process of HBase". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.