Example Analysis of graceful offline Service under Serverless Architecture 04/16 Update SLTechnology News&Howtos

Example Analysis of graceful offline Service under Serverless Architecture

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

What this article shares with you is an example analysis of the elegant offline service under the Serverless architecture. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article.

Application release and service upgrade have always been an exciting and worrying thing for developers and operation and maintenance students.

Excited is the launch of new features, their own products can provide users with more capabilities and value; worry about the online process will not be unexpected circumstances affect the stability of the business. Indeed, online problems are more likely to occur when the application is released and the service is upgraded. In this article, we will discuss how to ensure the elegant offline of the service during the launch process in combination with the Serverless application engine (hereinafter referred to as SAE) under the Serverless architecture.

In the usual release process, have we encountered the following problems:

During the release process, the request in progress was interrupted?

The downstream service node has been offline, but the upstream still continues to call the offline node, resulting in a request error, which in turn leads to a business exception.

The publishing process causes data inconsistency and needs to be repaired for dirty data.

Sometimes, we arrange the release version at two or three o'clock in the morning, when the business flow is relatively small, frightened, lack of sleep, and miserable. So how to solve the above problems, how to ensure that the application release process is stable and efficient, and that the business is lossless? First of all, let's sort out the causes of these problems.

Scene analysis

The figure above describes a common scenario in which we develop applications using the micro-service architecture. Let's first take a look at the service invocation relationship in this scenario:

Services B and C register the services with the registry, and services An and B discover the services that need to be invoked from the registry

Business traffic calls from load balancer to service A, and the health check of service An instance is configured on SLB. When service A has an instance downtime, the corresponding instance is removed from SLB. Service A calls service B, and service B invokes service C.

There are two types of traffic in the figure, north-south traffic (business traffic forwarded to back-end servers through SLB, such as business traffic-> SLB-> A call path) and east-west traffic (traffic invoked through registry service discovery, such as A-> B call path). These two types of traffic are analyzed below.

There are problems in the north-south flow.

When Service An is released, after the instance of Service A1 is down, SLB detects that Service A1 is offline according to health check, and then removes the instance from SLB. Instance A1 relies on the health check of SLB to remove it from SLB, which usually takes a few seconds to more than ten seconds. In this process, if SLB has continuous traffic, it will cause some requests to continue to be routed to instance A1, causing the request to fail.

During the release of Service A, how to ensure that the traffic passing through SLB will not be reported wrong? Let's take a look at how SAE does it.

Elegant upgrade scheme for north-south traffic

As mentioned above, the reason for the failure of the request is that the backend service instance is stopped before being removed from the SLB. Can we first remove the service instance from the SLB and then upgrade the instance?

According to this idea, SAE gives a solution based on the ability of K8S service. When a user binds SLB to an application through SAE, SAE will create a service resource in the cluster and associate the application instance with service. The CCM component will be responsible for the purchase of SLB, the creation of the SLB virtual server group, and adding the ENI network card associated with the application instance to the virtual server group. Users can access the application instance through SLB. When the application is released, CCM will first remove the ENI corresponding to the instance from the virtual server group, and then upgrade the instance to ensure that the traffic is not lost.

This is SAE's guarantee scheme for north-south traffic in the process of application upgrade.

There are problems in east-west flow.

After discussing the solution of north-south traffic, let's take a look at the east-west traffic. In the traditional publishing process, the service provider stops and starts again, and the service consumer perceives that the service provider node stops as follows:

1. Before the service is released, the consumer invokes the service provider according to the load balancing rules, and the business is normal.

two。 Service provider B needs to release a new version, first operate on one of the nodes, and first stop the java process.

3. The process of service stop is divided into active logout and passive logout. Active logout is quasi-real-time, and the time of passive logout is determined by different registries. In the worst case, it will take 1 minute.

1) if the application is stopped normally, the Shutdown Hook of the Spring Cloud and Dubbo frameworks can be executed normally, and the time consuming of this step is negligible.

2) if the application is stopped abnormally, such as directly using kill-9 to stop, or when the Docker image is built, the java application is not process 1 and the kill signal is not passed to the application. Then the service provider will not take the initiative to log out the service node, but will be passively removed by the registry after more than a period of time because of the heartbeat timeout.

4. The service registry notifies the consumer that one of the service provider nodes has gone offline. It includes push and polling. Push can be considered as quasi-real-time. The time-consuming of polling is determined by the polling interval of service consumers. In the worst case, it takes 1 minute.

5. The service consumer refreshes the service list and perceives that the service provider has taken a node offline. This step does not exist for the Dubbo framework, but the default refresh time for Ribbon, the load balancer component of Spring Cloud, is 30 seconds. In the worst case, it takes 30 seconds.

6. Service consumers no longer invoke nodes that have been offline.

From step 2 to step 6, Eureka takes 2 minutes in the worst case, and Nacos takes 50 seconds in the worst case. During this period, there may be problems with requests, so there will be a variety of errors when publishing, and at the same time, it will also affect the user's experience, and half of the dirty data needs to be repaired after release. In the end, every edition had to be released at two or three o'clock in the morning, trembling, lack of sleep, and miserable.

Elegant upgrade scheme for east-west traffic

After the above analysis, we see that in the traditional publishing process, the client has a service call error period because the client is not aware of the instance of the server offline in time. In the traditional publishing process, the list of service providers is mainly updated by notifying consumers through the registry. Can the service provider directly notify the service consumers by bypassing the registry? The answer is yes. We have mainly done two things:

The service provider application takes the initiative to log out the application to the registry before and after release, and marks the application as offline; the original logout service in the stop process phase is changed into a prestop phase logout service.

After receiving the request from the service consumer, the call will be processed normally, and the service consumer will be notified that the node has been offline, and the service consumer will immediately delete the node from the invocation list; after that, the service consumer will no longer call the offline node. This is the original reliance on the registry push, so that the service provider directly notifies the consumer to remove itself from the call list.

Through the above solution, the offline perception time is greatly reduced, from the original minute level to real-time, to ensure that the application can achieve business lossless when offline.

Batch release and grayscale release

Described above are some of the capabilities of SAE in dealing with elegant offline. In the process of application upgrade, only the elegant offline of the instance is not enough. We also need a set of matching release strategy to ensure that our new business is available. SAE provides the ability to release in batches and grayscale, which can make the release process of the application more painless and labor-saving.

Let's first introduce grayscale publishing. An application contains 10 application instances, and the deployment version of each application instance is Ver.1 version. Now you need to upgrade each application instance to Ver.2 version.

As can be seen from the figure, two instances are grayscale first during the release process, and then the remaining instances are released in batches after confirming that the business is normal. There are always instances running during the release process, and each instance has an elegant offline process according to the above plan in the process of case upgrade, which ensures that the business is lossless.

Let's take a look at batch release, which supports manual and automatic batch release; or the above 10 application instances, assuming that all application instances are deployed in 3 batches, according to the batch release strategy.

The above is an example analysis of the elegant offline service under the Serverless architecture. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.