Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to solve the problem of RegionServer crash caused by HBase when a large number of data are written

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to solve the problem of RegionServer crash caused by HBase when a large number of data are written". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to solve the problem of RegionServer crash caused by HBase when a large number of data are written".

The error message in the log is as follows:

Warn:hdfs.DFSClient:DFSOutputStream ResponseProcessor exception for block blk_xxxx java.net.SocketTimeoutException: 66000 millis timeout while waiting for channel to be ready for read. Ch: java.nio.channels.SocketChannel

In fact, this problem is not caused by the Replication function, but by the client timeout during data-intensive writes

= the following comes from the network = =

Normally, the process for DFSClient to write block data is:

1. DFSClient side

A) DFSOutputStream is responsible for receiving and writing data, that is, obtaining data through the write method (synchronized) in DFSOutputSummer, while sync (main code synchronized (this)) writes data to dataQueue through enqueuePacket after establishing packet through FlushBuffer.

B) DataStreamer (Daemon thread) in DFSOutputStream, which is responsible for sending data to DataNode. It checks whether there is any data in dataQueue before sending it, and waits if it is not.

C) when DataStreamer sets up pipeline to transmit data, it will set up a ResponseProcessor (Thread) to the pipeline to get the feedback ack of DataNode, and judge whether there is any error, recoverBlock, etc.

2. DataNode side

A) in each packet transmission process, according to the pipleLine that establishes the data transmission, the upstream sends data to the downstream in turn, and the downstream sends ack to the upstream in turn.

B) the last node of pipeline (numTarget=0), PacketResponder will always run lastDatanodeRun? Method, which will result in about 1 dfs.socket.timeout 2 dfs.socket.timeout after the ack is sent (ackQueue.size () = 0). Send heartbeats to client along the pipeline all the time.

3. HBase side

The HBase side writes data to hdfs through the writer in hlog, and sync every time the data is written. At the same time, there is a logSyncer in HLog, and the default configuration is to call sync once per second, regardless of whether data is written or not.

This problem is caused by timeouts in the first place. Let's first analyze what happened on DFSClient and DataNode before and after timeouts.

1. Problem recurrence

A) the client ResponseProcessor reported a 69-second socket timeout, and the error point was PipelineAck.readFields (). After the error, catch directly and mark hasError=true,closed=true. This thread will not stop.

B) DataStreamer calls processDatanodeError in polling to process hasError=true. At this point, errorIndex=0 (the default) will first throw an exception for Recovery for Block. Then close the blockstream and re-recoverBlock based on the pipeline of the two nodes.

C) on DataNode, processDatanodeError () closes blockstream. This will cause the packetResponder in pipeline to be interrupted and terminated.

D) on DataNode, processDatanodeError () closes blockstream, causing the readToBuf in BlockReceiver's readNextPacket () to read no data and an exception to throw EOFException. This exception is thrown all the way up to the run of DataXceiver, which will cause DataXceiver to stop running, prompting DataNode.dnRegistration Error.

E) recoverBlock will proceed normally and will be completed first on two nodes (the second and third). Namenode then finds that there is not enough replicas and initiates a transfer block command to DataNode, which is an asynchronous process. However, during the hlog check, the transfer is likely to be unfinished and will be reported to pipeline error detected. Found 2 replicas but expecting 3 replicas . And close hlog.

The above is the error process that can be seen according to the log.

two。 Analysis of problems

A) Why did it time out and why didn't the heartbeat be sent?

According to the above analysis, the 69-second timeout of ResponseProcessor socket is the cause of the subsequent series of exceptions and hlog shutdown. So why did the socket timeout occur? ResponseProcessor should receive the HeartBeat packet within the time of 1 dfs.socket.timeout 2.

After printing the log, we found that the dfs.socket.timeout configured on DataNode is 180 seconds, while HBase uses the default configuration of 60 seconds when calling DFSClient. Therefore, DFSClient believes that the timeout time is 3 × nodes.length+60=69 seconds, while the data Node side sends the heartbeat timeout=1/2 × 180 seconds 90 seconds! Therefore, if there is no data writing, the DataNode will send the heartbeat packet 90 seconds later, when the DFSClient has already socketTimeout and leads to a series of subsequent phenomena.

B) Why didn't a new packet be sent in 69 seconds?

Let's first analyze the synchronization relationship between DFSOutputStream write data and sync. DFSOutputStream inherits from FSOutputSummer,DFSOutputStream to receive data through FSOutputSummer's write method, which is synchronized. The flushBuffer () and enqueuePacket () of the sync method are also in the synchronized (this) code block. That is, for a DFSOutputStream thread, if sync and write are called at the same time, a synchronous wait will occur. In the HBase scenario, the frequency of sync is very high, and sync has a good chance of grabbing the lock. In this way, it is very likely to be constantly sync, constantly flushBuffer, but failed to write data through write (blocked). This is the reason why no packet has been sent during the timeout.

To sum up, due to the characteristics of HBase business calls and DFSOutputStream's synchronized code block, it is likely that there will be no packet writes in 69 seconds. At this point, however, socket timeouts should not be allowed. Socket timeouts are the root cause of this problem, while sockets timeouts are due to inconsistent configuration.

3. Problem solving

On the HDFS side and the HBase side, set a larger value for dfs.socket.timeout, such as 300000 (300s) [note that the values set in both places should be equal]

Thank you for reading, the above is the content of "how to solve the problem of RegionServer crash caused by HBase when a large number of data are written". After the study of this article, I believe you have a deeper understanding of how to solve the problem of RegionServer crash caused by HBase when a large number of data are written, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report