How to improve the governance efficiency of Java micro-services through Serverless 10/28 Update SLTechnology News&Howtos

How to improve the governance efficiency of Java micro-services through Serverless

2025-10-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to improve the governance efficiency of Java micro-services through Serverless. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Challenges to micro-service governance

At the beginning of the business, due to limited manpower, want to quickly develop and launch products, many teams use a single architecture to develop. However, with the development of the company, new business functions will continue to be added to the system, the system is getting larger and larger, the demand is increasing, more and more people will join the development team, and the code base will expand at a faster growth. slowly, single applications become more and more bloated, maintainability and flexibility are gradually reduced, and maintenance costs are getting higher and higher.

At this time, many teams will change the single application architecture to micro-service architecture to solve the problem of single application. However, with more and more micro-services, OPS will invest more and more, and it is necessary to ensure the normal operation and cooperation of dozens or even hundreds of services, which brings great challenges to OPS. Let's analyze these challenges from the perspective of software lifecycle:

How does the development test state achieve the isolation of development, testing, and online environment? How to quickly debug local changes? How to quickly deploy local changes?

How does the publishing state design the service publishing strategy? How to unplug the old version of the service without damage? How to implement the gray test of the new version of the service?

How to troubleshoot online problems in running state? What tools are available? How to deal with nodes with poor quality of service? How do we recover instances that don't work at all?

In the face of the above problems, what does the Serverless application engine do in this respect?

Serverless application engine

As shown in the figure above, Serverless Application engine (SAE) builds a Kubernetes cluster based on IaaS resources such as Dragon + ECI + VPC + SLB + NAS, which provides some capabilities for application management and micro-service governance. It can be hosted for different application types, such as Spring Cloud applications, Dubbo applications, HSF applications, Web applications and multilingual applications. And support Cloudtoolkit plug-ins, cloud effect RDC / Jenkins and other developer tools. On the Serverless application engine, zero code modification can migrate the applications of Java micro-services to Serverless.

Generally speaking, Serverless application engine can provide an one-stop application hosting solution with better cost and higher efficiency. With zero threshold, zero transformation and zero container foundation, you can enjoy the technical dividend brought by Serverless + K8s + micro-service.

Micro-service governance practice 1. Development practice 1) Multi-environment management

Multi-tenants share a registry that isolates traffic through different tenants; further, it is possible to isolate the environment through network VPC

Provide environment-level operation and maintenance operations, such as the ability to stop and pull up the entire environment with one click

Provide configuration management at the environment level

Provides environment-level gateway routing traffic management.

2) Joint adjustment in the cloud

Serverless Application engine (SAE) based on Alibaba CloudToolkit plug-in + jumping machine can be implemented:

Local services subscribe to and register with the cloud SAE built-in registry

Local services can be invoked with cloud SAE services.

As shown in the figure above, when implementing, users need to have an ECS proxy server, which actually registers the ECS proxy server to the SAE registry. After IDEA installs the Cloudtoolkit plug-in, when it starts the process, it will pull a channel service locally. This channel service will be connected to the ECS proxy server, all local requests will be transferred to the ECS proxy server, and the calls to the service in the cloud will be transferred to the local server through the ECS proxy. In this way, you can debug at the local breakpoint with the latest code, which is the implementation of cloud joint debugging.

3) construct the rapid development system.

After the code has been co-debugged locally, it can be quickly deployed to the cloud development environment with one click through the Maven plug-in and IDEA-plugin.

two。 Release practice 1) Application release three axes

Grayscale: in the process of publishing, the operation and maintenance platform must have release strategies, including single batch, batch, canary and other release strategies; at the same time, it should also support the grayscale of traffic; automatic / manual selection should also be allowed between batches.

Observable: the release process can be monitored, the release log and results can be viewed in real time on the white screen, and the problem can be located in time.

Rollback: allow manual intervention to control the release process: abnormal abort, one-click rollback.

Through these three points, we can make the release of the application grayscale, observable and rollable.

2) Lossless offline of micro-service

In the process of version replacement, how does SAE ensure that the micro-service traffic of the old version can be dropped without damage?

The figure above shows the whole process of micro-service registration and distribution, with service consumers and service providers having B1 and B2 instances respectively, and service consumers having A1 and A2 instances respectively.

B1 and B2 register themselves in the registry, and consumers refresh the list of services from the registry and find that service providers B1 and B2. Under normal circumstances, consumers start to call B1 or B2, and service provider B needs to release a new version. First, one of the nodes is operated, such as B1, the Java process is stopped first, and the service stop process is divided into active destruction and passive destruction. Active destruction is quasi-real-time. The time for passive destruction is determined by different registries, and the worst-case scenario may take one minute.

If the application is stopped normally, the ShutdownHook of the Spring Cloud and Dubbo frameworks can be executed normally, and the time consuming of this step is basically negligible. If the application is stopped abnormally, such as a stop of a direct Kill-9, or when the Docker image is built, the Java process is not process No. 1 and the Kill signal is not passed to the application, then the service provider will not take the initiative to log off the node, it will wait for the registry to discover and passively perceive the process of service offline.

When the microservice registry senses that the service is offline, it will notify the service consumer that one of the service nodes has been offline. There are two ways: registry push and consumer rotation. The registry refreshes the list of services, sensing that the provider has taken a node offline, a step that does not exist for the Dubbo framework, but for Spring Cloud, its worst refresh time is 30 seconds. After the consumer's service list is updated, offline node B is no longer called. From step 2 to step 6, if the registry is Eureka, the worst-case scenario takes two minutes; if it is Nacos, the worst-case scenario takes 50 seconds.

During this time, there may be problems with requests, so there will be all kinds of errors when publishing.

According to the above analysis, in the traditional publishing process, the client has a server call error period, which is due to the fact that the client is not aware of the offline instance of the server in time. This is mainly because the service provider uses micro-services to notify consumers to update the list provided by the service.

Can the service provider inform the service consumer directly, bypassing the registry? The answer is yes. SAE has done two things. First, before the application is released, the service provider will actively log out the application to the service registry and mark the application as offline, turning the logout of the original stop process phase into the preStop phase logout process.

When receiving a request from a service consumer, the request will be processed normally first, and the service consumer will be notified that this node has gone offline. After that, the consumer will immediately refresh his service list after receiving the notification. After that, the service consumer will no longer send the request to the instance of service provider B1.

Through the above solution, the offline perception time is greatly shortened, from the original minute level to quasi-real-time, to ensure that your application can achieve business lossless when offline.

3) Grayscale publishing based on tags

The publishing strategy is divided into batch release and grayscale release, how to achieve the gray level of traffic? As you can see from the architecture diagram above, before the application is released, you need to configure a grayscale rule, such as using the residual value of uid = 20 as the grayscale traffic rule. When the application is released, the published node will be identified as a grayscale version. In this case, when traffic comes in, both the microservice gateway and the consumer will get the grayscale rule configured in the governance center through the configuration center.

The consumer's Agent will also pull some information about the service it depends on from the registry. When a traffic enters the consumer, it will match according to the grayscale rule. If it is a grayscale traffic, it will be transferred to a grayscale machine; if it is a normal traffic, it will be transferred to a normal machine, which is the specific logic of grayscale publishing based on the tag.

3. Running practice 1) powerful application monitoring & diagnosis ability

How to troubleshoot and solve this or that kind of problem during the operation of the service?

The prerequisite for troubleshooting and solving is that you must have strong application monitoring and diagnostic capabilities. SAE integrates cloud product ARMS, so that the Java micro-service running above can see the application's call relationship topology diagram, and locate the call stack of your MySQL slow service method, and then locate the code-level problem.

For example, if the response of a request is slow and there is a problem with the business, it can locate which request, which service, and which line of code of the service has a problem, which can bring a lot of convenience for solving the problem. Generally speaking, we need to have the ability to monitor the alarm before we can better diagnose the problems in the operation of the service.

2) Fault isolation and service recovery

It is mentioned above that we can troubleshoot and solve the problems encountered through monitoring and alarm, so can our system take the initiative to do something? As a Serverless platform, SAE has many capabilities of self-operation and maintenance. There are two scenarios in the figure below:

Scenario 1: during the operation of an application, several machines have a high load or poor network status due to full disks or competing for host resources, resulting in a call timeout or an error on the client.

In the face of this situation, SAE provides service governance capability, that is, outlier removal, which can be configured. When the network timeout is serious or the 5xx error of the back-end service reaches a certain proportion, you can choose to remove the node from the list of consumer services, so that the problematic machine will no longer respond to business requests, thus ensuring the SLA of the business.

Scenario 2: OOM is triggered when an application runs out of memory due to sudden traffic.

In this case, through the Serverless application engine such as SAE, after the node is configured with health check, the container in the node can be pulled up again, and the process can be restored quickly.

3) accurate capacity + current limiting degradation + extreme elasticity

Based on the interaction between SAE and other products on the Serverless Paas platform, the closed loop of the whole operation and maintenance state is achieved.

When using it, users can use PTS stress testing tool to construct the scene, and then get some thresholds. For example, you can estimate the resources consumed by the traffic peak, and then you can design resilient policies based on these thresholds. When the business system reaches the request ratio, it can expand and scale its own machine according to the flexible policy set.

When it comes to capacity expansion and reduction, it may not be able to keep up with the processing of a large number of requests. In this case, you can configure the current limit and downgrade capability by interacting with AHAS. When there is sudden heavy traffic, you can first use the ability of AHAS to block some traffic, and then trigger the expansion policy applied on SAE to expand the instance. When the expansion of these instances is completed, the aPCge load of the whole machine will decrease and the traffic will be put back in. From sudden heavy flow to current limiting degradation and then to capacity expansion, and finally to the flow reaching a normal state, this is the best practice model of "accurate capacity + current limiting degradation + extreme elasticity".

First of all, according to the idea of raising and solving problems, this paper describes how to solve the problems in the development, release and operation of micro-services, and then introduces how to achieve accurate flow, current-limiting degradation and extreme flexibility through the interaction between Serverless products and other products.

The development test mode isolates the registry multi-tenant from the network environment, and provides environment-level capabilities; through cloud-based joint debugging technology to quickly modulate local changes; if the IDE plug-in quickly deploy local changes.

The released operation and maintenance platform needs to be grayscale, observable and rollable for application release; to achieve lossless offline service through MSE agent capability; and to provide online traffic gray testing ability through label routing.

The operation mode establishes a strong application monitoring and diagnosis capability; it has the ability to remove the nodes with poor service quality; it can restart the instance recovery business through configuration health check for the instances that are no longer working; a precise capacity + current-limited degradation + extreme elasticity model is provided.

So much for sharing about how to improve the governance efficiency of Java micro-services through Serverless. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.