How to design an elegant heartbeat mechanism 07/11 Update SLTechnology News&Howtos

How to design an elegant heartbeat mechanism

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Source: Fate/stay night [Heaven's Feel] lost butterfly

1 preface

In the previous article, "talking about TCP persistent connections and heartbeats", we talked about KeepAlive in TCP and the significance of designing heartbeats in the application layer, but did not introduce the design of long-connected heartbeats in detail. In fact, it is not easy to design a good heartbeat mechanism. According to several RPC frameworks I am familiar with, their heartbeat mechanisms can be said to be quite different. In this article, I will explore how to design an elegant heartbeat mechanism, mainly from the existing solutions of Dubbo and an improved scheme.

2 preliminary knowledge

Because we will introduce it from the source code level in the future, some details of the service governance framework need to be explained in advance to facilitate your understanding.

2.1 how does the client know that the request failed?

Almost all high-performance RPC frameworks choose Netty as the component of the communication layer, and I don't need to introduce the efficiency of non-blocking communication too much. However, because of the non-blocking characteristic, it is an asynchronous process to send and receive data, so when there are server exceptions and network problems, the client connection can not receive a response, so how can we judge that a RPC call is a failure?

Myth 1: isn't the Dubbo call synchronized by default?

Dubbo is asynchronous in the communication layer, giving users the illusion of synchronization because of internal blocking waiting to achieve asynchronous synchronization.

Myth 2: Channel.writeAndFlush will return a channelFuture. I only need to judge channelFuture.isSuccess to determine whether the request is successful or not.

Note that the success of writeAndFlush does not mean that the client has received the request. A return value of true only ensures that the network buffer is successfully written, but does not mean that it is sent successfully.

To avoid the above two misunderstandings, let's go back to the title of this section: how does the client know that the request failed? The correct logic should be based on the failure response received by the client. Wait, wait. Yes, since the server will not return, it will have to be created by the client itself.

A common design is that the client initiates a RPC request and sets a timeout client_timeout. When the client initiates the call, the client starts a timer to delay the client_timeout.

Remove the timer when a normal response is received.

If the timer countdown is complete and has not been removed, the request is considered to have timed out and a failed response is constructed and passed to the client.

Timeout determination logic in Dubbo:

Public static DefaultFuture newFuture (Channel channel, Request request, int timeout) {final DefaultFuture future = new DefaultFuture (channel, request, timeout); / / timeout check timeoutCheck (future); return future;} private static void timeoutCheck (DefaultFuture future) {TimeoutCheckTask task = new TimeoutCheckTask (future); TIME_OUT_TIMER.newTimeout (task, future.getTimeout (), TimeUnit.MILLISECONDS);} private static class TimeoutCheckTask implements TimerTask {private DefaultFuture future; TimeoutCheckTask (DefaultFuture future) {this.future = future } @ Override public void run (Timeout timeout) {if (future = = null | | future.isDone ()) {return;} / / create exception response. Response timeoutResponse = new Response (future.getId ()); / / set timeout status. TimeoutResponse.setStatus (future.isSent ()? Response.SERVER_TIMEOUT: Response.CLIENT_TIMEOUT); timeoutResponse.setErrorMessage (future.getTimeoutMessage (true)); / / handle response. DefaultFuture.received (future.getChannel (), timeoutResponse);}}

The main logic involves classes: DubboInvoker, HeaderExchangeChannel, DefaultFuture, through the above code, we can know a detail. No matter what kind of call it is, it will be detected by this timer. If the call fails when it times out, the failure of a RPC call must be based on the failure response received by the client.

2.2 heartbeat detection requires fault tolerance

Network communication should always take into account the worst-case scenario, a heartbeat failure, can not be identified as a disconnected, multiple heartbeat failure, can take corresponding measures.

2.3 heartbeat detection does not require busy detection

The opposite of busy testing is idle testing. The original purpose of our heartbeat is to ensure the availability of the connection and to ensure that measures such as disconnection and reconnection are taken in time. If there are frequent RPC calls in progress on a channel, we should not put a burden on the channel to send heartbeats. The role of heartbeat should be to collect umbrellas in sunny days and send umbrellas in rainy days.

3 existing schemes for Dubbo

The source code of this article corresponds to Dubbo 2.7.x, where the heartbeat mechanism has been enhanced in the version incubated by apache.

After introducing some basic concepts, let's take a look at how Dubbo designs the application layer heartbeat. The heartbeat of Dubbo is two-way, and the client will send the heartbeat to the server, otherwise, the server will also send the heartbeat to the client.

3.1 create a timer public class HeaderExchangeClient implements ExchangeClient {private int heartbeat; private int heartbeatTimeout; private HashedWheelTimer heartbeatTimer; public HeaderExchangeClient (Client client, boolean needHeartbeat) {this.client = client; this.channel = new HeaderExchangeChannel (client); this.heartbeat = client.getUrl () .getParameter (Constants.HEARTBEAT_KEY, dubbo! = null & & dubbo.startsWith ("1.0.") when the connection is established? Constants.DEFAULT_HEARTBEAT: 0); this.heartbeatTimeout = client.getUrl (). GetParameter (Constants.HEARTBEAT_TIMEOUT_KEY, heartbeat * 3); if (needHeartbeat) {long tickDuration = calculateLeastDuration (heartbeat); heartbeatTimer = new HashedWheelTimer (new NamedThreadFactory ("dubbo-client-heartbeat", true), tickDuration, TimeUnit.MILLISECONDS, Constants.TICKS_PER_WHEEL); startHeartbeatTimer ();}

Not only the HeaderExchangeClient client starts the timer, but also the HeaderExchangeServer server starts the timer. Because the logic of the server is almost the same as that of the client, I will not paste the server code again later.

Dubbo used the shedule scheme in earlier versions and replaced it with HashWheelTimer in 2.7.x.

3.2 start two scheduled tasks

Private void startHeartbeatTimer () {

Long heartbeatTick = calculateLeastDuration (heartbeat)

Long heartbeatTimeoutTick = calculateLeastDuration (heartbeatTimeout)

HeartbeatTimerTask heartBeatTimerTask = new HeartbeatTimerTask (cp, heartbeatTick, heartbeat)

ReconnectTimerTask reconnectTimerTask = new ReconnectTimerTask (cp, heartbeatTimeoutTick, heartbeatTimeout)

HeartbeatTimer.newTimeout (heartBeatTimerTask, heartbeatTick, TimeUnit.MILLISECONDS)

HeartbeatTimer.newTimeout (reconnectTimerTask, heartbeatTimeoutTick, TimeUnit.MILLISECONDS)

}

Dubbo mainly starts two timers in the startHeartbeatTimer method: HeartbeatTimerTask and ReconnectTimerTask.

As for the other code in the method, it is also an important analysis content of this article. Let me sell it first, and then look at traceability later.

3.3 scheduled Task 1: send a heartbeat request

Analyze the logical HeartbeatTimerTask#doTask of heartbeat detection timing task in detail:

Protected void doTask (Channel channel) {Long lastRead = lastRead (channel); Long lastWrite = lastWrite (channel); if ((lastRead! = null & & now ()-lastRead > heartbeat) | | (lastWrite! = null & & now ()-lastWrite > heartbeat) {Request req = new Request (); req.setVersion (Version.getProtocolVersion ()); req.setTwoWay (true) Req.setEvent (Request.HEARTBEAT_EVENT); channel.send (req);}

As mentioned earlier, Dubbo is designed to have a two-way heartbeat, that is, the server sends a heartbeat to the client, and the client also sends a heartbeat to the server. The receiver updates the lastRead field, and the sender updates the lastWrite field. After the time between heartbeats, a heartbeat request is sent to the peer. The lastRead/lastWrite here will also be updated by ordinary calls on the same channel. By updating these two fields, the mechanism of sending idle messages only when the connection is idle is implemented, which is in line with our practice of popular science at the beginning.

Note: not only heartbeat requests update lastRead and lastWrite, but also regular requests. This corresponds to the idle detection mechanism in our preparatory knowledge.

3.4 scheduled task 2: deal with reconnection and disconnection

Continue to study what ReconnectTimerTask#doTask is implemented by the reconnect and disconnect timers.

Protected void doTask (Channel channel) {Long lastRead = lastRead (channel); Long now = now (); if (lastRead! = null & & now-lastRead > heartbeatTimeout) {if (channel instanceof Client) {(Client) channel). Reconnect ();} else {channel.close ();}

The second timer is responsible for different processing of the connection according to the type of client and server. When the total heartbeat time is exceeded, the client chooses to reconnect and the server chooses to disconnect directly. It is reasonable to consider that the client invocation is strongly dependent on available connections, while the server can wait for the client to re-establish the connection.

Careful friends will find that it is not accurate for this class to be named ReconnectTimerTask because it deals with the logic of reconnection and disconnection.

3.5 inaccurate timing

Someone in Dubbo's issue once reported that the timing was inaccurate. Let's see what's going on.

The default heartbeat cycle in Dubbo is 60s. Imagine the following sequence:

In the zero second, the heartbeat test found that the connection was active.

In the first second, the connection is actually disconnected

In the 60th second, the heartbeat test found that the connection was inactive.

Due to the problem of time window, the dead chain can not be detected in time, and the worst case is a heartbeat cycle.

To solve the above problem, let's go back to the startHeartbeatTimer () method above

Long heartbeatTick = calculateLeastDuration (heartbeat); long heartbeatTimeoutTick = calculateLeastDuration (heartbeatTimeout)

CalculateLeastDuration calculates a tick time based on heartbeat time and timeout time, which actually divides the two variables by 3, shrinking their values and passing them into the second parameter of HashWeelTimer.

HeartbeatTimer.newTimeout (heartBeatTimerTask, heartbeatTick, TimeUnit.MILLISECONDS); heartbeatTimer.newTimeout (reconnectTimerTask, heartbeatTimeoutTick, TimeUnit.MILLISECONDS)

Tick means the frequency with which timed tasks are executed. In this way, by reducing the detection interval, the probability of finding a dead chain in time is increased, and the worst case is 60s, but now it is 20s. This frequency can still be accelerated, but resource consumption needs to be taken into account.

The problem of inaccurate timing occurs in the two scheduled tasks of Dubbo, so both tick operations are done. In fact, all the logic of timing detection has a similar problem.

3.6 Dubbo heartbeat summary

For each connection established, Dubbo starts two timers on both the client side and the server side, one for regularly sending heartbeats and one for timing reconnection and disconnection, and the execution frequency is 1 inch 3 of the respective detection cycle. The task of regularly sending heartbeats is responsible for sending heartbeats to the peer when the connection is idle. The task of timing reconnection and disconnection is responsible for checking whether the lastRead has not been updated during the timeout period. If it is determined to be a timeout, the client handles the logic of reconnection, and the server takes disconnection measures.

Let's not rush to judge whether this plan is good or not, and then let's take a look at how the improved scheme is designed.

4 Dubbo improvement scheme

In fact, we can implement the heartbeat mechanism more elegantly, and at the beginning of this section, I will introduce a new heartbeat mechanism.

4.1 IdleStateHandler introduction

Netty provides natural support for idle connection detection, and the idle detection logic can be easily realized by using IdleStateHandler.

Public IdleStateHandler (long readerIdleTime, long writerIdleTime, long allIdleTime, TimeUnit unit) {}

ReaderIdleTime: read timeout

WriterIdleTime: write timeout

AllIdleTime: timeout of all types

The IdleStateHandler class iterates to check how long the channelRead and write methods have not been called based on the set timeout parameters. When an IdleSateHandler is added to a pipeline, the IdleStateEvent event can be detected in the userEventTriggered method of any Handler of the pipeline

Overridepublic void userEventTriggered (ChannelHandlerContext ctx, Object evt) throws Exception {if (evt instanceof IdleStateEvent) {/ / do something} ctx.fireUserEventTriggered (evt);}

Why do you need to introduce IdleStateHandler? In fact, when it comes to its idle detection + timing, we should be able to think of, isn't it natural to serve the heartbeat mechanism? Many service governance frameworks choose to use IdleStateHandler to achieve heartbeats.

IdleStateHandler internally uses eventLoop.schedule (task) to implement scheduled tasks. The advantage of using eventLoop threads is that it also ensures thread safety. Here is a small detail.

4.2 client and server configuration

The first step is to add IdleStateHandler to the pipeline.

Client:

Bootstrap.handler (new ChannelInitializer () {@ Override protected void initChannel (NioSocketChannel ch) throws Exception {ch.pipeline () .addLast ("clientIdleHandler", new IdleStateHandler (60,0,0);}})

Server:

ServerBootstrap.childHandler (new ChannelInitializer () {@ Override protected void initChannel (NioSocketChannel ch) throws Exception {ch.pipeline () .addLast ("serverIdleHandler", new IdleStateHandler (0,0,200));}}

The client has configured a read timeout of 60s and the server has configured a write/read timeout of 200s. Here are two foreshadowing:

Why is the timeout configured by the client and the server inconsistent?

Why does the client detect a read timeout while the server detects a read and write timeout?

4.3 Idle timeout logic-client

For the processing logic of idle timeout, the client and the server are different. First, let's take a look at the client.

@ Overridepublic void userEventTriggered (ChannelHandlerContext ctx, Object evt) throws Exception {if (evt instanceof IdleStateEvent) {/ / send heartbeat sendHeartBeat ();} else {super.userEventTriggered (ctx, evt);}}

After an idle timeout is detected, the action taken is to send a heartbeat packet to the server, exactly how to send it, and how to process the response. The pseudo code is as follows

Public void sendHeartBeat () {Invocation invocation = new Invocation (); invocation.setInvocationType (InvocationType.HEART_BEAT); channel.writeAndFlush (invocation) .addListener (new CallbackFuture () {@ Override public void callback (Future future) {RPCResult result = future.get (); / / timeout or write failure if (result.isError ()) {channel.addFailedHeartBeatTimes () If (channel.getFailedHeartBeatTimes () > = channel.getMaxHeartBeatFailedTimes ()) {channel.reconnect ();}} else {channel.clearHeartBeatFailedTimes ();}});}

The behavior is not complicated. Construct a heartbeat packet and send it to the server to accept the response result.

Response succeeded, clear the request failure flag

Response failed, heartbeat failure mark + 1, if more than the configured number of failures, then reconnect

Not only the heartbeat, but also the flag will be cleared when the ordinary request returns a successful response.

4.4 Idle timeout logic-server @ Overridepublic void userEventTriggered (ChannelHandlerContext ctx, Object evt) throws Exception {if (evt instanceof IdleStateEvent) {channel.close ();} else {super.userEventTriggered (ctx, evt);}}

The server handles idle connections in a very simple and rude way, closing the connection directly.

4.5 heartbeat summary of the improvement scheme

Why is the timeout configured by the client and the server inconsistent?

Because the client has retry logic, it is considered to be disconnected only after sending a heartbeat for n times, while the server is disconnected directly, leaving it to the server for a longer time. 60 * 3 < 200 also shows that both parties have the ability to disconnect, but the creation of the connection is initiated by the client, so the client has the right to disconnect actively.

Why does the client detect a read timeout while the server detects a read and write timeout?

This is actually a consensus of heartbeat, think carefully, timing logic is initiated by the client, so the only thing that is blocked in the whole link is: the server receives, the server sends, the client receives. In other words, only the pong of the client, and the detection of the ping,pong of the server is meaningful.

It was you who took the initiative to pursue others, and it was you who took the initiative to break up.

Using IdleStateHandler to realize heartbeat mechanism can be said to be very elegant. With the idle detection mechanism provided by Netty, the client maintains an one-way heartbeat. After receiving three heartbeat failure responses, the client disconnects and reconnects with asynchronous threads, which is essentially a client reconnection. After the connection is idle for a long time, the server disconnects actively to avoid unnecessary waste of resources.

5 comparison of heartbeat design schemes

Privately asked Meituan Dianping's long-term connection person in charge: Yu Chao (Flash), the heartbeat scheme used by Meidian is almost the same as the Dubbo improvement scheme, but the scheme is standard.

6 suggestions for actual changes to Dubbo

Given that there are some implementations of other communication layers in Dubbo, the existing logic of sending heartbeats on a regular basis can be retained.

It is suggested to change point 1:

The design of two-way heartbeat is unnecessary and compatible with the existing logic, which allows the client to send an one-way heartbeat when the connection is idle, and the server periodically checks the availability of the connection. Timing time as far as possible: client timeout * 3 ≈ server timeout

It is suggested to change point 2:

Excluding the scheduled tasks dealing with reconnection and disconnection, Dubbo can determine whether the response to the heartbeat request has failed. It can use the design of the improved scheme for reference to maintain a mark of the number of heartbeat failures at the connection level, and clear the mark if the response is successful. If the heartbeat fails n times in a row, the client initiates a reconnection. This reduces the number of unnecessary timers, and any polling method is not elegant.

Finally, let's talk about extensibility. In fact, I suggest giving the timer to a lower-level Netty, that is, completely using IdleStateHandler, and other communication layer components each implement their own idle detection logic, but the compatibility of mina,grizzy in Dubbo limits my hands, but how many people are using mina and grizzy in 2019 now? It is certainly not a good thing to limit the optimization of mainstream usage because of features that are unlikely to be used. Abstract, functional, extensible is not the more the better, the human resources of open source products are limited, the understanding ability of framework users is also limited, the design that can solve the problems of most people is a good design. Hey, mina, grizzy, I can't learn.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.