How to transform the micro-service of logical thinking Go language 07/13 Update SLTechnology News&Howtos

How to transform the micro-service of logical thinking Go language

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

How to transform the logical thinking Go language micro-service, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

1. The background of transformation

The earliest APP is the application of a single PHP, the largest yellow block in the picture, and the blue block in the middle represents different modules. The yellow below represents passport and the payment system, which existed before it could be done, because the company had Wechat's e-commerce business in its early days.

Later, it is found that there are some business logic that does not need to be obtained, and some data format conversion work does not need to be fully coupled with the business, so the gateway with a layer of PHP is the V3 part shown in the following figure. However, there are some problems with this. The PHP backend is FPM. Once the response of the backend interface is slow, you need to start a large number of FPM to ensure concurrent access, resulting in high operating system load. From this point, it is not appropriate to use PHP to do this part of the work.

1.1 the house missed all night's rain.

1.2 Transformation goal

High performance

First of all, high performance, if you run dozens of QPS on a single machine, then it is difficult to meet the requirements of the stack machine.

Service

Servicealization actually started before the failure, and since our different business teams are already responsible for different businesses, we actually need to continue to do so.

Resource split isolation

With the service process, resources need to be split, each service needs to provide a corresponding interface, between services can not directly access the database or cache of other services.

High availability

The goal at the time was 99.9 availability.

1.3Why choose Go

Go has many benefits, but the most important thing is that it is easier for PHP programmers to get started, and the performance is much better.

two。 The process of transformation

2.1 first there is a system architecture diagram

For the transformation of the system, we first need to know what the system needs to be changed into. So we need an architectural blueprint. This is our architectural blueprint. The first thing you need is a unified external API GATEWAY, the yellow part at the top of the picture. The lilac part in the middle is for external business services. The light green part is the basic resource services, such as audio text information, encryption services. The red below are common services such as payment and passport, and on the far right are some common frameworks and middleware. At the bottom is some infrastructure.

Our framework is intertwined with the improvement of infrastructure and system refactoring, not that there is no problem at the beginning of the design, with the transformation of the business, there will be a lot of new features added.

2.2 improvement of framework and infrastructure

I'm not going to talk about how to split the application system, because each company's business system is different. I'll talk about our part of the framework and middleware.

API gateway

API gateway was developed by our team with Chen Hao (the famous left ear mouse). Their team has contributed a lot to our success in the New year. I would like to thank you in advance.

Objective current limit

The main purpose of API gateway is to limit current. In the process of transformation, we have more than 400 interfaces on the line, and new functions are often added. We can guarantee the performance of the new interface, but there are always old interfaces that are neglected in the process of transformation. Current restriction through API gateway can ensure that some users of the old interface are available when the traffic is large.

Upgrade API

Most API upgrades are solved with the client, but we do not force users to upgrade, resulting in the existence of the old online interface for a long time. We need to do some work at the API gateway layer to convert the new interface data format to the old interface data format.

Authentication

After the split of the service, we need to uniformly authenticate and control the interface. The practice of the industry is usually done at the gateway level, and we are no exception.

Let's take a look at the architecture of API gateway.

API gateway consists of a write node and multiple read nodes, and the nodes communicate with each other through the gossip protocol. There is a command line for CLI at the top of each node, which can be used to call Gateway's API. The lower-level HTTPServer, etc., is a plugin, which consists of multiple plugin to form different pipeline to handle different requests. I will introduce the design of this later. Each node has a statistical module to do some statistics, this statistics is mainly interface average response time, QPS and so on. After modifying the configuration, the write node synchronizes the configuration information to the read node and persists it to the local disk through the model module.

The request goes through two segments of pipeline, the first of which is based on the url of the request. You can combine different pipeline on top of different plugin. Suppose an interface does not need to limit current, it only needs to add no limiter plugin to the configuration of the interface. The second pipeline is based on the backend Server configuration and does some load balancing work.

Next, let's take a look at the whole process and scheduling aspects of API gateway startup.

Startup is relatively simple, to load plugin, and then to load the corresponding configuration file, according to the configuration file to plugin and pipeline to do. The scheduler in the upper right corner is divided into static scheduling and dynamic scheduling. Static scheduling assumes that five go routine are allocated for processing, and there are always five go routine to process the corresponding request. The dynamic scheduler varies between a go routine maximum and a minimum, depending on the busy level of the request.

API gateway authentication is relatively simple. If the client calls the login API, passport will transfer the token and userid to API gateway,API gateway and then the corresponding token to the app. The client will take the token request next time, and return to the client if the token verification fails. If the verification obtains the result by calling a different service at the back end, the result is finally returned to the client.

Finally, I would like to emphasize how API gateway works.

We introduce two current-limiting strategies into API gateway

1. Sliding window current limit

Why is the current limited according to the sliding window? Because there are too many online interfaces, we don't know whether it is better to limit them to 100,200 or 10000, unless each of them is subjected to stress testing. Use a sliding window to count the response time, the number of successes and failures within a time window, and eliminate this statistics to determine whether the next time window should be current-limited.

Current limit of 2.QPS

Why leave a QPS current limit? Because to do an activity, the sliding window is a time window. When doing an activity, the customer picks up the phone and scans the QR code, and the traffic comes in instantly. The sliding window is very difficult to play a role in this case.

Service framework

Objective to simplify the common architecture of application development service registration and convenient configuration management service framework

The first way is to make a library that compiles the relevant functionality into the service itself. There are two problems here. The first is that we are compatible with several languages and have a large amount of development. Another is that once the client is released to the production environment with the service caller, if you want to upgrade the customer library, it is bound to require the service caller to modify the code and reissue, so there is a lot of resistance to the upgrade and promotion of this scheme. In the industry, spring cloud,dubbo,motan uses such a mechanism.

Another solution is to take the function of Lord Balancing into an agent and run alone with consumer. Each consumer request is to get the address of Service Provder through agent, and then call Service Provder. The advantage is that it simplifies the service caller, there is no need to develop client libraries for different languages, and the upgrade of LB does not require the service caller to change the code. The disadvantage is also obvious, the deployment is more complex; and usability testing will be a little more troublesome, and the agent may also fail. If agent hangs up, the whole service will have to be taken off. The same is true of Baidu's internal BNS and Airbnb's SmartStack service discovery framework. Because we have more internal languages, we choose the second approach.

In a Consul cluster, Consul's agent is deployed and running on each node that provides services, and a collection of all running Consul agent nodes forms a Consul Cluster. Consul agent has two modes of operation: Server and Client. The Server and Client here are only the distinction at the Consul cluster level and have nothing to do with the application services built on top of Cluster. Consul agent nodes running in Server mode are used to maintain the state of Consul clusters. It is officially recommended that there are at least 3 or more Agent,Client nodes running on Server mode per Consul Cluster.

There is no strict distinction between the roles of Client and Server in DDNS. The service is Client when the service is requested and Server when the service is provided.

What NNDS provides is that a SDK can be easily integrated and extended into a separate service and integrate more functions. Using the agent method, the installed agent will be deployed on each server, and requests can be made using HTTP and grpc.

After the service has been started and can provide services to the outside world, request the interface v1/service/register of agent to register it into DDNS

If the registration is successful, other clients can obtain the APP node information through the DDNS discovery API.

If the registration fails, APP will repeatedly try to re-register, and alarm if the retry fails three times.

Suppose that service A needs to request service B, and the service name is bbb, and directly request the local agent interface v1/service/getservice to obtain the service node information of bbb.

For agent, if the service bbb is requested for the first time, the Consul cluster will be requested. After obtaining the data of the service bbb, the local slave cache and the watch monitoring of the nodes serving the bbb will be carried out, and the local service information will be updated regularly.

If the acquisition fails, give the reason, and alarm if it is a system error.

This is the basic interface of the service framework

This is the encapsulation of client calls, which can support both HTTP and JRTC. After this, we have also done permission control for RBAC. We hope to adjust which services can be used for permission control.

Multi-machine cache

If the client request goes to server,server, look for it in the cache first, find it and return it. If not, find it in the database. If you find it, set it back to the cache and then return to the client. Here is a relatively simple model. There is only first-level cache, but first-level cache may not be enough. For example, during the stress test, we found that the QPS supported by an redis to the interface in our business situation is about 10,000. What if the QPS is a little higher? We introduce multi-level caching.

The closer to the above cache, the smaller the cache. The first level is the service local cache. If it is hit, the data will be returned. If not, go to L1 to check. If it is found, update the local cache and return the data. If you don't have L1, go.

L2 level check, update L1 cache/local cache if data is found, and return data

What we see above is the cache for a single piece of content itself. on the whole stack, gateway can also cache part of the data without requiring penetration. What does the dotted line of 5 mean? Because the data needs to be updated after modification, sometimes it will fail in the application layer, so read the database binlog to make up the leak and reduce the data inconsistency.

I have always thought that if there is generic code that is much easier to write, there will be a lot of reflection instead of generics in the generic framework.

Php with redis cache

Go with redis cache

Go with big cache

Go with object cache

Home page of free column

100 +

600 +

2000 +

12000 +

Home page of paid column

200 +

900 +

2400 +

14000 +

After the multi-level cache is added, the overall performance is compared. The earliest PHP is one or two hundred, and after it is changed to Go, it is not much stronger. The later Go and big cache are about two thousand, but there are some problems, which we will talk about later. Later, based on the cache of the object, the object is cached, and the machine we run the test is in eight cores, and it is acceptable to achieve such a result.

Fuse degradation

The API requests internal services at the same time. Service7, 8 and 9 are different. Service5 is in a dead state, but the external service is still being called each time. We need to reduce the number of calls and let service5 recover.

In the open state, when the failure reaches a certain threshold, it is closed, and when the fuse window ends, it reaches a half-open state to accept part of the request. If the threshold for failure is high, return to the off state. The statistical method is the sliding window algorithm we mentioned earlier.

Here is the library that has transplanted JAVA hystrix, there are many well-done frameworks and libraries in JAVA, which are worth using for reference.

3. Experience summary

3.1 Common basic libraries are very important

In the performance improvement part mentioned just now, it only took us one day to improve QPS from 600 to 12000. The main reason is that we have made a lot of optimizations through the basic library, and all services will benefit from the improvement made by the basic library.

3.2 make good use of tools

Generate + framework improves development efficiency

Pprof+trace+go-torch identifies performance issues

For example, we use a lot of generate + framework to generate a lot of code through generate and templates. Pprof+trace+go-torch can save you a lot of work when checking performance. Go-torch makes fire diagrams, and the new version of Go has built-in flame diagrams.

This is according to our table structure to generate the corresponding database access code, multi-level cache is to abstract all access to KmurVmai Kraplist and other access modes, every time it is too tedious to write manually, we have made a tool, which table you use, the tool will be generated, you just need to assemble it.

When locating performance problems, the flame map must be used.

For example, the positioning performance problem depends on where the longest place is. Efforts should be made to optimize the code of this hot spot. During the pressure test, we found that there was a problem with the flame diagram of 600,900. After the optimization is completed, it is shown in the following figure.

3.3 Summary of other experiences

Optimize for hot code

Reasonable reuse of objects

Try to avoid reflection

Reasonable serialization and deserialization

GC overhead

For example, we used to have a service that got a lot of ID list from the cache, and the data was stored in json format. In this way, we found that the serialization and deserialization performance of json is very expensive, which basically accounts for more than 50% of the overhead. In the morning, Didi talked about their json library, which can improve the performance by 10 times. In fact, it can not improve so much in our scenario, but it can only be doubled. Of course, doubling is also a big improvement (because you can improve so much with only one line of code). Secondly, the GC problem caused by json meal serialization is also very serious, and the 20%CPU can be achieved at its peak, even when the Go algorithm does very well. The ultimate solution is to introduce PB instead of json here. PB deserialization performance (in our case) is indeed 10 times better than json, and much fewer temporary objects are allocated, thus reducing GC overhead.

Why avoid reflection? We have built local cache locally, and caching the entire object requires that you cannot modify the object outside the cache, but there is an actual business requirement. After we've had this situation, we use reflection to do deep copy. JAVA reflection can also be used because jvm generates reflection code into JAVA code, which is actually called. But not in Go. Originally, the performance of Go is similar to that of C. after using a lot of reflection, the performance is close to that of python. Later we defined an interface for cloneable and let the programmer do the clone work manually.

Pressure testing

We mainly use ab and Siege, which are usually stress tests for a single system. In fact, there may be problems in every part of the call chain during the user's use. So in the case of microservices, stress testing for a single system, while important, is not enough to completely eliminate all the problems in our system.

For example, when Boss Luo wants to send something in the New year, he first needs to get something, which is an interface. Then the user will usually scan the purchase list again to see if it is there, and finally confirm whether what he has received is correct. So you need to pressure the entire link, not just the receiving interface, which may be problematic. Suppose that the interface of the purchased list is relatively slow, and the user will take a look at it after receiving it. In the absence of it, the average user will continue to brush, so the slower the interface is, the more likely it is to become a bottleneck. Therefore, it is necessary to reasonably plan the access path and stress test all the services on the link, not just one service.

We bought Aliyun's PTS service directly. They simulate the request on the CDN node and can simulate the entire access path.

4. What are you doing?

4.1 Sub-library, sub-table and distributed transaction

Choosing a database is related to the operation and maintenance of your company. Distributed transactions are more important to me. We have a lot of purchasing links. Once the micro service is dismantled, as long as there is one error, we need to roll back the whole thing. What we do now is manual control, but as there are more and more businesses behind you, it is impossible to control all of them manually, so we need to have a distributed transaction framework, so we are now doing our own distributed transaction framework based on TCC.

Sub-library and sub-table is also a rigid requirement. The main reason why we are not on tidb here is that the DBA team is not familiar with tidb. Our previous sub-library and sub-table is also handled by programmers themselves, and now we are working on a framework that can support both sub-library and sub-table, as well as hash and range.

4.2API gateway

There are a lot of things to do on API gateway, and we did something about the circuit breaker and demotion. Now some Service mesh do a lot of things is to put a lot of work on the internal API gateway, is to do control things, in fact, should not be the concern of business logic. We are also thinking about how to combine API gateway and SM.

4.3 APM

After dismantling the microservice, the biggest problem is that it is not convenient to locate the specific problem. We sometimes have problems, and I ask several people to see if the system they are responsible for is right, and people can see where the problem lies. This is a rather painful practice. After entering APM+tracing, it is convenient for us to track where the problem is.

4.4 containerization

Our current online environment is still using virtual machines. The simulation environment and the test environment are already containers, and there are many advantages to using containers, so I won't list them one by one. This is also the key work that we will do in the second half of the year.

4.5 cache serviceability

We now have a multi-level cache implementation, but multi-level cache is still implemented in the form of a library. If you pull out the cache, use memcached or redis protocols to extract it as a stand-alone service. The following business systems do not have to worry about the expansion and reduction strategy of the cache itself when iterating.

This is the answer to the question on how to transform the Go language micro-service with logical thinking. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.