How to use K8S to complete the test environment of 1000 applications 07/06 Update SLTechnology News&Howtos

How to use K8S to complete the test environment of 1000 applications

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to use K8S to complete the test environment of 1000 applications, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

Problems and status quo

Almost half a year ago, the infrastructure team was given a task: environmental governance. The reason is that the development team is plagued by the problems of the test environment, which affects the efficiency and progress of research and development, and is in urgent need of a set of solutions.

Let's take a look at the situation at that time. Xinya Technology is a financial Internet company with a history of more than 10 years, with a lot of business products and 1000 + applications. The whole system can be said to be very huge. Some colleagues vividly described the system as a railway from Xinjiang to Shanghai, where users got on the train in Xinjiang, passed through the sites of more than 10 provinces (teams), and finally got off in Shanghai.

It is conceivable that no matter which part of the road goes wrong, the whole road may be impassable. At that time, there were three sets of relatively complete environment within the company, namely, FAT, UAT and PRE. However, UAT and PRE are not directly used by developers, and all development tests are using the FAT environment, resulting in serious conflicts, including the launch of new functional tests, the repair of old Bug, and daily component failures. It can be said that communication basically depended on roaring at that time, and if something went wrong, sometimes different components and different teams had to troubleshoot the problem, which greatly affected the efficiency.

Core issues

The status quo knows, what is the core problem? Through the analysis, we think that there are two core issues.

A stable system requires stable components. There is only one set of FAT environment, and everyone makes changes on it, which results in some modules of the whole system not fully tested, so the module is unstable, resulting in the instability of the whole system. The rapid development of business requires rapid iteration of components and constant component updates. Business requirements are changeable, business advances, Bug fixes, new requirements need to be tested, and if there are no changes, then the business cannot proceed, so components must be changed.

In the end, the system is not stable, and if you want to be stable, you need components to be stable, but with the rapid development of business, components and new versions are constantly introduced. The two contradict each other.

What is the crux of the problem?

Resource conflict, there is only a complete set of environment, resources have become a bottleneck, if we copy a few more sets of environmental problems will not be solved?

But the question is, copy a few sets of environment, how to copy this environment? How to control the resource cost and maintenance cost of replication? We hope that the cost of replication and maintenance is small enough. secondly, multiple sets of environments mean that the application versions of each environment need to be synchronized constantly, otherwise each version of the environment will age, deteriorate, and gradually lose the meaning of its existence. finally, we go back to a test environment.

So this is how we designed it in the end.

Small target

According to the definition of the core problem, we also set two small goals.

Provide a stable test environment to ensure the stability of the whole test environment. It synchronizes with the production environment to ensure that it is always up-to-date online.

Provide multiple private test environments for each team, and each team testing in a private environment does not affect other teams.

Before implementing the goal, we also have sufficient technical reserves and development specifications. First of all, the container and K8S have been landed, the application has been basically fully containerized, the container release process has become a platform, and the team's application configuration has also used central configuration services such as Apollo. Components such as MQ support Restful interfaces, and containers have static IP allocation technology. These technical conditions are the technical cornerstone of the ultimate success of the project.

Through efforts, the nebula system was born. Using it, we operate K8S to complete the creation, management and destruction of the environment, and achieve the expected goals well. According to the actual situation of our company, most applications use Restful RPC for interaction, use domain name for addressing, and generally use Nginx for message forwarding, all designs take this as the premise. )

Four advantages

Quickly create a test environment

Applications can be reused in different environments.

Zero configuration uses the system.

Application does not need to be modified.

How do you do all this?

Overall design

How do we use K8S to do all this? The overall components and processes are like this.

We have created a nebula multi-environment management platform, where users can create a test environment by themselves. After the creation instructions are issued, the nebula platform will convert the instructions into a batch of instructions to create containers. Use the existing container publishing platform to create containers (mainly using existing stable services). The existing container publishing platform will let K8S launch the specified images and applications in the specified order and configure them accordingly.

After all the applications are launched in an environment, a test environment is ready, and testers or developers can use this independent test environment for development and testing. Each environment is an independent environment and does not affect each other.

Here we only use K8S to manage the container, let K8S manage Pod, and do not involve any other services, so the dependence on it is reduced to a minimum.

Structure and principle

The design and implementation of the nebula looks something like this from the user's point of view. The environment is completely isolated.

First of all, we need to make each environment work normally, do not affect each other, how to achieve all this?

At this point, we actually refer to the architecture of K8S itself. how does K8S make the whole system work?

There are two very important components in K8S, one is CoreDNS and the other is Ingress, which perform two important functions of internal addressing and external communication respectively. In the nebula, we also learn from this design idea.

We pre-added two important components to each new environment, one is the Dns, and the other is the traffic gateway (Nginx).

They serve only for a single environment and are the core of a single environment. The DNS component starts first. It communicates with the Nebula multi-environment management platform to obtain the data of the instance in the test environment. The gateway is an ordinary nginx. It will also communicate with the nebula management platform to obtain the corresponding forwarding strategy and work according to the forwarding policy.

When ordinary business applications in the environment start, they first get information from the DNS components in the environment, so that they can locate each other and forward messages to each other through the gateway. When the user needs to test using this environment externally, the test calls the test traffic to the gateway container. The default forwarding policy of the gateway distributes the traffic to the appropriate containers in the environment to make the whole test work properly.

Dns implementation

The implementation of Dns itself is not complicated. The following figure simply describes the principle of DNS. The nebula multi-environment management platform will synchronize the information to the Dns components of each environment according to the instance status of the environment, and the records of the same site are prioritized. In this way, we can customize the effect we want.

When the application in environment 1 accesses A.com, it will get the result of 192.168.2.200. You get 192.168.2.101 when you visit B.com and 172.168.2.4 when you visit C.com.

Because of the priority, there is automatically synchronized data, and users can also set it freely, so we can design the DNS within the environment at will in the center of the nebula.

To sum up, the DNS records in each environment are controlled by nebulae, most of which are automatic, but you can also control it.

Gateway implementation

The principle of the implementation of the gateway is similar. The nebula control center will calculate what the rules of its gateway look like according to different environments. The nebula pushes rules to the gateway server. In this way, through the cooperation of the gateway and the Dns, the nebula system completes the operation at the control level, and the gateway and the DNS perform the operation at the data flow level.

RPC implementation

After the completion of the DNS and the gateway, we have only completed the preliminary construction of the environment, and there are still many details that need to be optimized, the most important of which is the RPC part.

Because we have 1000 + instances, if we simply pull up instances, then one environment needs to pull 1000 instances, and there are 10,000 + instances in 10 sets of environments. First of all, this is a huge load in terms of resources. Secondly, the maintenance cost of maintaining so many environments will be very high. So we need to enable applications to be reused in different environments.

As the saying goes, a picture is worth a thousand words, and the whole system is almost the scene described in the following picture.

The common basic environment is the Default basic environment, which is consistent with production, and we keep it stable. Project environment-1, project environment-2, project environment-3 are generated by the nebula using the container management technology of K8S. Users are here to test new requirements and fix Bug without affecting each other. It will not affect the stability of the FAT environment.

What we finally achieved is something like this.

The colored arrows represent the flow of RPC messages in different environments.

RPC retrofit design

First of all, we refer to Google's paper: Dapper, a tracking system for large-scale distributed systems. Most APM (Application Performance Management) are developed on this basis. We have made some extensions based on this design to meet our needs.

There is a concept in APM called TraceID, which uniquely identifies a call. It is unique in itself and has no other meaning. It can be held in a request and its subrequests. In our system, we need similar capabilities, in addition to the need for TraceID, but also need an EnvID, when the request and its sub-requests are transmitted in different systems, in addition to a unique TraceID, but also with EnvID. We rely on EnvID to achieve the flow of multiple environments.

The arrow in the image above represents the RPC message, and different colors indicate that the Env-ID contained in the message is different. For example, the red arrow indicates that the env-ID contained in the message is FAT1, the green arrow indicates that the env-ID contained is FAT2, and the blue arrow indicates that the Env-ID contained is Default.

At this point, we need to solve three problems:

How to generate EnvID? What does it look like? how does EnvID transmit in the system? When does EnvID work, and how does it work? Why does it solve the problem?

For this reason, we use three solutions: dyeing, transparent transmission and intelligent routing to solve these three problems. The following is the specific process.

The generation of the question 1:EnvID, what does it look like?

At present, all systems are initiated by the client, such as H5, mobile phone and App. They use standard Http requests to interact with the back-end system to initiate business. The client uses requests like this to interact with the server.

The only thing we need to do is to add a logo to indicate our source when the traffic enters each environment. Something like this.

We use this logo to indicate that the message came from that environment, a process we call dyeing. It will exist throughout the life cycle and will not change in it and its subrequests.

Comparison of dyeing schemes: there are two schemes about how to dye, one is called client dyeing, and the other is gateway dyeing.

Client dyeing

In this scheme, the client needs to be modified so that the requests made by the client are directly marked with the environment identity. The advantage of this scheme is that the logo is customizable and operable. But the disadvantage is also obvious, it will invade the front-end business, inflexible, H5, Web and other clients may not support, if the content of the environment identification needs to be expanded, then all clients need to be rebuilt.

Gateway dyeing

The scheme of gateway dyeing is much more flexible, it does not need to transform the client, does not invade the business, and can support H5 and Web, and the upgrade is also very convenient. So we chose the scheme of gateway dyeing.

How is the question 2:EnvID transmitted in the system?

This is actually a technical debt, and its essence is to obtain the ability to transmit information like PinPoint, zipkin, jaeger and skywalking in APM technology.

Combined with the characteristics of our project, if we choose the scheme with code intrusion, the cost of rebuilding a large number of existing systems is too high. We can only use the scheme without code intrusion. Finally, we implement a J ava Agent that implements a capability similar to Skywalking/pip. The Env-ID can be propagated through the application in the sub-request.

In the following example, you can see that after receiving a request, Application 1 carries the same Env-id on all its child requests.

When does EnvID work, and how does it work?

Although we already know how to generate EnvID, how to transmit it through, and how to spread it, we still don't know how it can help us reuse multiple environmental applications. So below we will explain in detail how EnvID works, a process we call intelligent routing.

This process can be divided into two situations. Here we demonstrate the process of An application calling B application and B application calling C application.

General situation

This situation is very simple. The situation here shows that there are different versions in each environment. Calls between applications will not cross the environment, and each application will only call other applications within the ring.

The real operation is actually like this. Each request is forwarded through the gateway and is not directly connected.

The nebula multi-environment platform controls the Dns as well as the gateway forwarding rules. The nebula multi-environment platform system allows all applications to send requests to the gateway. The gateway receives the request and forwards it according to the rules. In this case, it doesn't seem to make much sense, even if it's directly connected. Okay, let's look at the next scene.

Special circumstances

In this scenario, we will achieve the following effect

This basically covers all cases of application reuse.

In the test environment FAT1, the version 1 instance of An application wants to send the request to B application, but there is no instance of B application in FAT1 environment, so the request is sent to the stable version instance of B application in Default environment. After the stable version instance of B application in Default environment is processed, the stable version instance of B application wants to send the request to C application, because there is no C application in FAT1 environment. So the request continues to be sent to a stable version of the C application in the Default environment.

In the test environment FAT2, the version 2 instance of An application wants to send the request to B application, but there is no instance of B application in FAT2 environment, so the request is sent to the stable version instance of B application in Default environment. After the stable version instance of B application in Default environment is processed, the stable version instance of B application wants to send the request to C application, because there is an instance of C application in FAT-2 environment. So the stable version instance of the B application in the Default environment sends the request to C version 2 in the test environment Fat2. Finally, it not only realizes the shuttle of messages in different environments, but also realizes the application reuse that we want first.

The message in the Default environment is the same as above. How on earth is this achieved? The real process is roughly as follows.

The gateway rules for each environment are automatically generated by the nebula system, and the nebula system knows the status of all environment instances, so the nebula writes gateway rules like this. If there is an instance of an application in an environment, the gateway directs the request to the instance of the environment. If there is no such instance, the destination of the application is directed to the gateway of Default.

In this case, when the FAT1,FAT2 gateway of the test environment receives a request to the B application, because there is no B application in the environment, the request is sent directly to the Default gateway.

How does the Default gateway work? The Default Gateway is an Openresty server, and we have slightly expanded its capabilities. Its workflow is as follows:

After it receives the request message, it parses the request first, finds out the Env-ID and the target domain name, determines which environment it comes from, and which target application it wants to go to.

It will use the target application domain name and Env-ID to look up this instance information in the nebula.

If the nebula tells it that the domain name + Env-ID instance exists, it forwards the message to the gateway of the Env-ID environment. If the domain name + Env-ID instance does not exist, it forwards the message to the corresponding instance in the Default environment.

According to the above rules, let's look at the above implementation again. Take the test environment FAT2 as an example:

The Default gateway queries the nebula: does an instance with FAT2 domain name B with Env-ID exist? The nebula replies that the Default gateway forwards the request to the gateway of the FAT2 environment, the gateway of the Fat2 environment receives the request, and forwards the message to the version 2 instance of the C application.

According to this design, we have achieved our goal without changing the application. According to this design, regardless of whether there is a corresponding instance in the environment or not, the message can flow according to the process we designed. This can help us save a lot of examples.

Cost of RPC transformation

I have also referred to the experience of many companies before the start of the project, and I think one of the more challenging areas is the cost of transformation. Many large companies have considered the functions of dyeing and routing by label at the beginning of RPC design, and they have paid attention to this problem in all applications. For our project, there are a large number of historical systems, do not have this condition, if we transform the RPC, we need to re-transform all the projects, packaging, testing, time costs and labor costs are too high. We have adopted the current low-cost scheme and regard the application transformation as another long-term scheme. Finally achieve fast, low-cost online.

If the basic measures in these areas of RPC are done well, it will be easy to achieve this goal. In addition, there are less intrusive solutions that use Service Mesh to implement the function of the gateway in Sidecar, but the message transmission still needs to be solved.

Redis optimization

In addition to RPC, Redis is also a technical point that we need to overcome. Because of the new environment, a new Redis instance is needed to avoid data conflicts. As a general practice, Redis instances need to be regenerated for each environment. But we have encountered some problems, because there are many ways to use Redis, some projects use IP+ ports, some use domain names, and some use Cachecloud, which is not uniform. Standardizing the use of redis is a solution in the long run, but how do you solve existing problems? A common approach is to replace the configuration, which will inevitably invade the configuration of the business. In order to reduce this intrusion, we adopt the Java Agent mode, intercept the interface of the application calling Redis, and add a prefix to the key of Redis, which not only achieves data isolation, but also saves resources. The general principle is like this.

All data in different environments is prefixed so that they do not interfere with each other, so that a redis instance can be shared by multiple environments at the same time. (if the redis data prefix configuration is prefabricated in the specification at the beginning of development, it will be easy to solve this problem. )

Front-end optimization

According to the normal plan, we need to start more than 40 containers to do this. In this way, the utilization rate of resources is low, and in order to save resources, we finally put all the static resources of a project into a mirror image. Then use dns and gateway to achieve the proxy of the front-end site, so that an application can serve dozens of static sites at the same time, which not only saves resources, but also makes site management very simple and flexible.

Database

At present, all environments reuse a set of database systems, but the reason is that the scale of the database is too large, and there are still great challenges in data initialization, reuse and data merging. Therefore, user testing generally uses different regions and accounts for logical isolation, which can meet the needs at present. But there is still a lot of room for improvement in data collation.

After reading the above, do you have any further understanding of how to use K8S to build a test environment for 1000 applications? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.