Design practice of 100g Cluster of UCloud physical Cloud Gateway 04/10 Update SLTechnology News&Howtos

Design practice of 100g Cluster of UCloud physical Cloud Gateway

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Physical CVM is a dedicated physical server provided by UCloud, which has excellent computing performance, meets the needs of high performance and stability in core application scenarios, and can be flexibly matched with other products. Physical cloud gateways are used to carry private network communications between physical cloud and public cloud products. Due to the need for multiple deployments, gateway clusters are faced with cross-region and cross-cluster traffic pressure.

We solve the problem of traffic overload caused by Hash polarization by means of multi-tunnel traffic dispersion, and limit elephant flow through capacity management and lossless migration in the isolation zone. After the launch of the new scheme, the cluster has been upgraded from carrying dozens of gigabytes to hundreds of gigabytes of traffic, helping users such as Dada smoothly get through the traffic peak of Singles Day. The following is the sharing of practical experience. one

1. Physical cloud with overloaded traffic

In order to ensure the high availability of business on the cloud, users usually deploy the business in different regions. At this point, users' physical clouds need to access each other through physical cloud gateways. Inevitably, physical cloud gateways will carry a large amount of cross-cluster access traffic from physical CVMs.

At the same time, in order to ensure the isolation of network traffic between different users and any exchange of visits within the computer room, the physical cloud gateway will encapsulate the tunnel of the user's message and then send it to the receiver.

1. Problems arise: physical cloud of Hash polarization and overload

As shown in the figure below, we find that the bandwidth of gateway device e in physical cloud cluster 2 is overloaded, affecting all services accessing cluster 2. Further inspection through monitoring shows that the traffic distribution of cluster 2 is very uneven, and the bandwidth of some devices in the cluster has been knocked out, but the remaining device traffic is very small. Through packet capture analysis, almost all the traffic of gateway device e comes from physical cloud cluster 1.

Figure: encapsulation tunnel for cross-cluster access

Combined with business analysis, it is determined that the reason for physical cloud overload is that the mutual visit traffic between physical cloud cluster 1 and cluster 2 appears Hash polarization, resulting in uneven traffic distribution.

So what is Hash polarization?

Due to the use of a single tunnel transmission between clusters, tunnel encapsulation hides the user's original information, such as IP, MAC, etc., and only presents tunnel information, while the tunnel uses unique SIP and DIP. Then the Hash algorithm is the same and the calculated results are consistent, resulting in a sudden increase in the load of a single device in the cluster, resulting in a sudden increase in the load of a single device in the cluster. In extreme cases, the phenomenon of explosion will occur, which will affect all users in the cluster. This is Hash polarization, which often occurs in multiple Hash scenarios across devices.

According to the current situation, we try to solve the problem from the following two angles:

① if user traffic can be scattered, how to avoid Hash polarization after encapsulation tunnel?

② if the user traffic can not be broken up, how to prevent the "elephant flow" from blowing up the physical cloud network?

Next, we study the corresponding solutions from these two points respectively.

2. How to avoid the polarization of Hash after encapsulation tunnel?

At first, we proposed several solutions to this problem:

① scenario 1: user traffic is polled by the switch and sent to each device in the cluster. The advantage of this method is that the flow can be fully dispersed and there will be no Hash polarization. But at the same time, the disadvantage is that the timing of network messages is disrupted, which may affect the user's business.

② scheme 2: the switch is based on the tunnel inner layer message Hash. This method is based on the packet fragmentation of users, and its advantage is that it can be scattered on different devices in the cluster in a more balanced manner. But the problem is that after the user message is encapsulated in the tunnel, it will be sliced again, which will lead to the loss of inner message information and the Hash of fragmented messages to different devices.

③ scenario 3: assign a separate tunnel source IP to each device in the cluster. This method can achieve effective traffic dispersion, but due to the limited number of tunnels, the problem of Hash inequality is still obvious in the existing network.

The above three methods all have shortcomings in varying degrees and can not completely solve the problem of Hash polarization. Through a series of studies, we finally found a multi-tunnel solution. That is to break the single tunnel mode of the gateway, all gateways bind a tunnel IP of a network segment, based on the user's inner message information Hash, and select the tunnel SIP and DIP in the pre-allocated network segment to ensure that different traffic is distributed in different tunnels as far as possible, thus breaking up the user traffic.

Figure: multi-tunnel solution schematic

3. How to prevent the "elephant flow" from exploding the physical cloud network?

The premise of the multi-tunnel solution is that user traffic can be scattered, but what if there is an "elephant flow"? Even multiple tunnels cannot avoid being blown up. In the face of the "elephant flow" of users, relying solely on technical means is not enough, we also need to prevent and avoid in advance from the aspect of hardware configuration.

■ single machine capacity management

First of all, we need to carry out reasonable capacity management of the physical cloud gateway to ensure that the bandwidth of the gateway is higher than that of users' physical CVMs, and at the same time ensure that the carrying capacity of the whole cluster meets the needs of users.

Figure: example-adjust stand-alone capacity from 10G to 25G

In fact, this is closely related to the capabilities of cloud vendors. At present, the bearing capacity of a single UCloud gateway cluster is far greater than that of a single user. In the case of carrying multi-user aggregated traffic, it can still ensure that the sudden "elephant flow" of individual users will not break the gateway.

Lossless migration of ■ quarantines

Increasing the capacity of a single machine is far from enough, just in case, UCloud is also equipped with a quarantine, which is usually passed through without traffic.

Figure: quarantine lossless migration

As shown in the figure above, once excessive traffic is detected and there is a risk that the cluster will be blown out, the automatic migration system supporting the cluster will modify the physical machine database information that needs to be migrated, and automatically update the corresponding forwarding rules. Part of the business traffic can be shared through the quarantine. At the same time, we will automatically verify the migration results based on strong verification technology to ensure that the migration business is lossless and reliable.

4. Examples: comparison of user applications under new and old schemes

Before the launch of the new scheme, due to the polarization of Hash, the cluster usually can only carry dozens of gigabytes of traffic, and overload occurs from time to time.

After the new solution is launched, the following monitoring chart shows that the traffic is basically scattered on the cluster, and the advantages of the cluster have been brought into full play. at present, the cluster can carry hundreds of gigabytes of traffic and fully resist the risk of a sudden increase in user business. For example, the traffic pressure of Dada at double 11: 60G is a common phenomenon, and when the traffic reaches 100G in case of a burst, the cluster traffic is still forwarded normally, which has no impact on the business.

Figure: traffic monitoring diagram

In addition to improving performance, the high availability design has also been optimized in this cluster upgrade.

II. High availability optimization after cluster upgrade

For cluster upgrading, in general, a new grayscale cluster will be deployed first, and then the user business will be migrated gradually. The advantage of this is that in the case of defects in the new cluster version, the scope of influence can be controlled as much as possible. when a failure occurs, the affected user business can be moved back to the old cluster in time to avoid the user business being affected.

Figure: expected result-new Manager takes over grayscale cluster

A problem was found in the grayscale process.

After the deployment of the new cluster Manager, the grayscale cluster takes over the old cluster due to configuration errors. Manager automatically takes over the control of the cluster based on the cluster information of the configuration file, and directly issues the configuration information, and the old cluster accepts the wrong configuration. Due to the great difference between the configuration of the old cluster and the new cluster, there is a mistake in explaining the new configuration of the old cluster, resulting in a high availability exception.

Figure: grayscale Manager error takeover of the old cluster

1. Risk analysis

In order to systematically avoid such problems, we conducted a retrospective analysis of the configuration process and summarized the risks:

Too much human intervention in ① deployment will increase the failure probability.

The exception protection of ② program is not enough

The effective isolation between ③ clusters is insufficient, if the fault affects a wide range.

2. Optimization: automatic operation and maintenance & program optimization & isolation of influence

■ automatic operation and maintenance

Automatic operation and maintenance can effectively avoid the occurrence of human error through automation instead of manual operation. We optimize the cluster deployment process and divide it into two processes: configuration storage and deployment. Operation and maintenance personnel only need to enter the necessary configuration information, and the rest are automatically generated and deployed.

■ perfect checksum alarm

In addition, we also optimize some programs to increase the check of abnormal configuration. For example, before the configuration is loaded, the whitelist needs to be filtered first, and if the configuration is found to be abnormal, the configuration loading is terminated, and the alarm notification is followed by manual intervention.

Figure: the whitelist restriction program allows only the correct control plane synchronization configuration

■ isolation effect

Finally, no matter how sophisticated the automated operation and maintenance mechanism and the program itself, it is always assumed that an exception is possible. Under this premise, it is also necessary to consider how to minimize the impact range and time in the event of a failure. Our solution is as follows:

► removes public dependency

The previous problem is mainly due to the fact that all devices in the cluster rely on abnormal Manager at the same time, resulting in a loss of all. Therefore, it is necessary to remove the common dependency in the cluster devices and reduce the scope of influence. For example, different clusters bind different Manager, which can effectively control the scope of influence. Of course, the common dependency of the cluster may not only appear in the Manager, but also an IP, a rack, etc., which requires us to carefully identify in the actual project.

► sets the quarantine zone when the scope of influence is controllable, a Manager exception will only affect some devices in the cluster. In this case, you should also quickly remove the abnormal devices or migrate all users under the cluster to the quarantine zone as soon as possible.

Summary

With the development of technology and the expansion of business, the system architecture is becoming more and more complex, the correlation is getting closer and closer, and the requirements for technicians are getting higher and higher. In the development process of physical cloud gateway cluster, there will inevitably be a lot of "holes", but you need to adhere to one thing at any time: all technologies are for business services. To this end, we share the experience of scheme design, hoping to give you more thinking and harvest.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.