Analysis of the problem of Hbase regionserver hanging up one by one 04/29 Update SLTechnology News&Howtos

Analysis of the problem of Hbase regionserver hanging up one by one

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Recently encountered a rather strange problem, a regionserver due to GC reasons, caused the link with zookeeper timed out, and finally was kicked out of the cluster. However, the next phenomenon is the beginning of the nightmare!

1. A regionserver was kicked out of the cluster because of the GC, which caused the link to zookeeper to time out.

~ ~ Hbase regionserver log~~~2018-05-31 11 Freight 42Partition 17739 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for cn_kong_groups,\ X00\ X00\ xBB\ xE9\ x03\ x03\ x00D\ xDF,1527701650816.a177e358544ffe3157a4c0531feb8e5a., current region memstore size 123.40 MB And 1Plus 1 column families' memstores are being flushed.2018-05-31 11 WARN 42JvmPauseMonitor WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 52612msGC pool 'ParNew' had collection (s): count=1 time=45897msGC pool' ConcurrentMarkSweep' had collection (s): count=2 time=6814ms2018-05-31 11Fringe 42WARN 17741 WARN [B.defaultRpcServer.handlerholder 0thqueue0Magnum Portal 16020] ipc.RpcServer: (responseTooSlow): {"processingtimems": 52721 "call": "Scan (org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", "client": "10.21.23.232 Scan 55676", "starttimems": 1527738085020, "queuetimems": 0, "class": "HRegionServer", "responsesize": 15, "method": "Scan"} 2018-05-31 11 11 11 INFO 17745 INFO [regionserver/regionserver1.bigdata.com/172.16.11.66:16020-SendThread (ip-10-21-14-154.bigdata.com:2181)] zookeeper.ClientCnxn: Client session timed out Have not heard from server in 61597ms for sessionid 0x15f3454790276db, closing socket connection and attempting reconnect2018-05-31 1111 INFO 17745 INFO [main-SendThread (ip-10-21-14-154.bigdata.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 61595ms for sessionid 0x15f3454790276da, closing socket connection and attempting reconnect2018-05-31 1111 INFO 17893 INFO [sync.3] wal.FSHLog: Slow sync cost: 148ms, current pipeline: [DatanodeInfoWithStorage [172.16.11.66 Swiss 50010 DSac448b6fly2964-4900-aeda-4547f2d956b8] DISK], DatanodeInfoWithStorage [172.16.11.67 4b65], DatanodeInfoWithStorage [172.16.11.67] current pipeline: [DatanodeInfoWithStorage [172.16.11.66] current pipeline: [DatanodeInfoWithStorage [172.16.11.66 DatanodeInfoWithStorage: 50010d5-4b65-9854-6b6ad6749b6] disk], DatanodeInfoWithStorage [10.21.23.41lav 50010 DSlem 56e0e28-5b3c], DatanodeInfoWithStorage [172.16.11.67] [172.16.11.67] DS-e8b1727a-81d5-4b65-9854-6b6ad6749b64DISK], DatanodeInfoWithStorage [10.21.23.41DISK], DatanodeInfoWithStorage [10.21.23.41GLV 50010 Magi DSlic56e0e28-5b3cMuc 4047Muc accd79f3f6bacc2Ji disk]

2. The problem of Full GC, putting aside, a strange problem arises. After starting the regionserver, soon the regionserver hung up again. Moreover, other regionserver began to hang up one after another.

2018-05-31 12 coprocessor.CoprocessorHost 20 ERROR [RS_OPEN_REGION-regionserver1:16020-2] coprocessor.CoprocessorHost: The coprocessor org.apache.dfs.storage.hbase.cube.v1.coprocessor.observer.AggregateRegionObserver threw org.apache.hadoop.ipc.RemoteException (org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation (StandbyState.java:87) at org.apache.hadoop.hdfs .server.namenode.NameNode $NameNodeHAContext.checkOperation (NameNode.java:1774) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation (FSNamesystem.java:1313) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo (FSNamesystem.java:3856) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo (NameNodeRpcServer.java:1006) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo (ClientNamenodeProtocolServerSideTranslatorPB.java:843) ) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod (ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call (ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call (RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run (Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run (Server.java:2045) At java.security.AccessController.doPrivileged (Native Method) at javax.security.auth.Subject.doAs (Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs (UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run (Server.java:2043) 2018-05-31 12 purse 20 Subject.java:415 12923 ERROR [RS_OPEN_REGION-regionserver1:16020-2] regionserver.HRegionServer: ABORTING regionserver regionserver1.bigdata.com 16020 1527740222472: The coprocessor org.apache.dfs.storage.hbase.cube.v1.coprocessor.observer.AggregateRegionObserver threw org.apache.hadoop.ipc.RemoteException (org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation (StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation (NameNode.java:1774) at org.apache.hadoop. Hdfs.server.namenode.FSNamesystem.checkOperation (FSNamesystem.java:1313)

3. Here is another regionserver log, where you can see "Operation category READ is not supported in state standby" and then "ABORTING regionserver"

Org.apache.hadoop.ipc.RemoteException (org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standbyat org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation (StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation (NameNode.java:1774) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation (FSNamesystem.java:1313) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo (FSNamesystem.java : 3856) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo (NameNodeRpcServer.java:1006) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo (ClientNamenodeProtocolServerSideTranslatorPB.java:843) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod (ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call (ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call (RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run (Server .java: 2049) at org.apache.hadoop.ipc.Server$Handler$1.run (Server.java:2045) at java.security.AccessController.doPrivileged (Native Method) at javax.security.auth.Subject.doAs (Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs (UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run (Server.java:2043) 2018-05-31 12 at java.security.AccessController.doPrivileged 20 RS_OPEN_REGION-regionserver2:16020 [RS_OPEN_REGION-regionserver2:16020-2] regionserver.HRegionServer: ABORTING regionserver regionserver2.bigdata.com 16020,1519906845598: The coprocessor

4. Why did this happen? It feels very weird.

Finally, after searching, and with the guidance of the old driver, the problem was found. Part of the hbase table, using the coprocessor function, and this is not the key

The information of the hbase table is as follows. Note this part: hdfs://dfs/dfs/dfs_metadata/coprocessor/*

Dfs_ZZQSWFZ4VW', {TABLE_ATTRIBUTES = > {coprocessor$1 = > 'hdfs://dfs/dfs/dfs_metadata/coprocessor/dfs-coprocessor-1.5.4.1-0.jar | org.apache.dfs.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService | 1001 |', coprocessor$2 = > 'hdfs://dfs/dfs/dfs_metadata/coprocessor/dfs-coprocessor-1.5.4.1-0.jar | org.apache.dfs.storage.hbase.cube.v1.coprocessor.observer.AggregateRegionObserver | 1002 |', METADATA = > {'CREATION_TIME' = >' 1478338090339' 'GIT_COMMIT' = >' c4e31c1b3a664f598352061ae8703812e9d9cef7 ',' dfs_HOST' = > 'dfs_metadata',' OWNER' = > 'xxxx.owner@bigdata.com',' SEGMENT' = > 'WindGreenwichOffline_ de [20161105070130 _ 20161105092132]', 'SPLIT_POLICY' = >' org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy', 'USER' = >' ADMIN'}}, {NAME = >'F1', DATA_BLOCK_ENCODING = > 'FAST_DIFF', BLOOMFILTER = >' NONE', COMPRESSION = > 'SNAPPY'}'

5. The key point is that the current hdfs namenode is in HA mode, but this cluster was originally a single point of namenode, while some of the hbase tables created at that time, using coprocessor's hbase table, were the trigger point of the problem.

These coprocessor specify the access name of the hdfs, so only if the original namenode is the state of active can it be accessed normally. Once the active-standby is switched, this part of the table cannot be loaded and accessed normally. It eventually leads to the exception of the entire regionserver.

The information of the hbase table is as follows. Note this part: hdfs://old_namenode_host.bigdata.com:9000/dfs/dfs_metadata/* uses a specific hostname here.

'dfs_1KT8V5FL1C', {TABLE_ATTRIBUTES = > {coprocessor$1 = >' | org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint | 107374663 |', coprocessor$2 = > 'hdfs://old_namenode_host.bigdata.com:9000/dfs/dfs_metadata/coprocessor/dfs-coprocessor-1.5.4.1-1.jar | org.apache.dfs.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService | 1001 |' Coprocessor$3 = > 'hdfs://old_namenode_host.bigdata.com:9000/dfs/dfs_metadata/coprocessor/dfs-coprocessor-1.5.4.1-1.jar | org.apache.dfs.storage.hbase.cube.v1.coprocessor.observer.AggregateRegionObserver | 1002 |', METADATA = > {'CREATION_TIME' = >' 1493632152602, 'GIT_COMMIT' = >' c4e31c1b3a664f598352061ae8703812e9d9cef7 ',' dfs_HOST' = > 'dfs_metadata',' OWNER' = > 'xxxx.owner@bigdata.com',' SEGMENT' = > 'WindCGN2_ Clone [20160501000000 _ 20170501000000]', 'SPLIT_POLICY' = >' org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy', 'USER' = >' ADMIN'}}, {NAME = >'F1', DATA_BLOCK_ENCODING = > 'FAST_DIFF', BLOOMFILTER = >' NONE', COMPRESSION = > 'SNAPPY'}

Finally, why did the other regionserver hang up one by one? Why is it that namenode has been switching for a long time, causing regionserver to hang up in GC, which leads to the chain reaction of regionserver hanging one by one?

1. GC causes the first regionserver to hang up, and the region on it must be assigned by master to other regionserver, and other regionserver cannot take over this region correctly, so it is the same error that causes the second regionserver to hang up, and then the third one. In theory, all regionserver will be dead.

two。 Since before the first regionserver hung up and the namenode active-standby was switched over, the hbase regionserver was started and the region was recognized normally, so there was no problem, but if the regionserver was restarted, the region could not be mounted normally.

For coprocessor features, please refer to the

Http://www.zhyea.com/2017/04/13/using-hbase-coprocessor.html

Https://www.3pillarglobal.com/insights/hbase-coprocessors

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.