UCloud High performance RoCE Network Design 07/02 Update SLTechnology News&Howtos

UCloud High performance RoCE Network Design

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

E-commerce, live broadcast and other services are required to complete request responses at a very fast speed. The rapid improvement of computing and storage is also promoting the popularity of new applications such as HPC, distributed training clusters, super-convergence and so on. Network has become one of the main factors restricting performance. For this reason, we have designed a RoCE network with low overhead and high performance, and built a large Ethernet data center with low latency and lossless. As the underlying cornerstone of RDMA and other technologies, it has also laid a good foundation for the future physical network construction of UCloud.

1. Lossless network selection with low overhead and high performance

When the ordinary intranet carries on the packet interaction, it usually uses the system-level TCP/IP protocol stack or DPDK technology. These two schemes rely on the software to unpackage the protocol stack, which consumes a lot of CPU of the system. There is a scheme: RDMA, which can directly use the network card to de-encapsulate the protocol stack without consuming the system CPU, and can effectively reduce the delay of data processing.

RDMA does not specify all the protocol stacks, such as the physical link layer, network layer, and transport layer, what each field looks like and how to use it, but it has high requirements for lossless networks:

-it is not easy to lose packets, and the delay caused by retransmission is very large.

-huge throughput, full run is the best.

-the lower the latency, the better. The 100us is too long.

According to the above requirements, there are three mainstream network schemes:

Figure: mainstream RDMA network solutions

① InfiniBand: this scheme redesigns the physical link layer, network layer and transport layer, which is the original deployment scheme of RDMA, so a dedicated InfiniBand switch is used as a physically isolated private network, which costs a lot of money, but has the best performance.

② iWARP: the purpose of this scheme is to enable mainstream Ethernet to support RDMA, port InfiniBand to TCP/IP protocol stack, and use TCP protocol to ensure no packet loss, but the disadvantage is that TCP overhead is high and the algorithm is complex, so the performance is poor.

③ RoCEv2: the goal of this solution is also to get mainstream Ethernet to support RDMA (the RoCEv1 version is rarely mentioned). The network side uses PFC to ensure no packet loss during congestion, and the network card side uses DCQCN congestion control algorithm to further alleviate congestion (this congestion algorithm needs to support ECN tags on the network side). The traditional Ethernet evolved into lossless Ethernet through the addition of PFC and ECN, and the performance of running RDMA on lossless Ethernet is greatly enhanced.

There are many mature cases of RoCEv2 (later referred to as RoCE) scheme, and we also choose this scheme for research. However, there are still some problems in RoCE scheme, such as the unfairness of PFC suppression, the risk of deadlock caused by PFC transmission, excessive parameter tuning, the lag of ECN markers (ECN probability markers are software polling mechanism) and so on, which need to be solved.

Second, the goal of network design

It's not easy to move RoCE to a classic data center network.

At present, the data center is a common CLOS architecture, LCS is the aggregation switch, and LAS is the TOR switch. If RoCE runs directly on this, the problem is obvious: for example, when an Incast event occurs, messages that cannot be forwarded will be stored in the switch cache, but the cache is not infinite. If it is full, the packet will be lost. Obviously, this packet loss frequency can not be accepted by RDMA.

Figure: schematic CLOS architecture

The above is just a simple example, but the problem is actually a little more complicated. Before designing, we need to make clear what our goal is and have a clear target.

To put it simply, our goal is:

-under all kinds of traffic models, the network bandwidth should be full.

-cache usage should be as low as possible

-in extreme cases, you cannot lose packets even if the cache is full.

To sum up, in order for RoCE to run on the existing network, we need to start from three aspects:

① QOS design: refers to a series of forwarding actions such as queuing, scheduling, shaping and so on, which are relatively independent.

② lossless design: is one of the requirements of RDMA, using PFC technology to achieve. Lossless is a basic guarantee, which means that it can ensure its availability even in the most congested cases, so that upper-layer applications can send data without worrying about the risk of packet loss (so PFC is not a means to slow down).

③ congestion control design: using DCQCN technology to achieve. Congestion control is a further optimization under the premise of meeting the basic guarantee, which means that at the beginning of congestion, tell both ends of the server to slow down from the source side, and fundamentally solve the problem.

Add a point about the disadvantages of congestion: when there is congestion, it is necessary to use the cache. Although there is no packet loss after using the cache, the result is that the delay increases, and the throughput can not be increased at all. There are many congestion points in the network, and each hop may become a congestion point. In the network pictured above, there will be up to 3 congestion points.

How much latency can be caused by the use of caching?

According to 25Gbps, caching 25Mb data takes about 1ms time to send, 25Mb is only 3.1MB, and common Broadcom Trident 3 chips have 32MB cache.

With the understanding of these three aspects, we can simplify the complexity and crack it one by one.

III. QOS design

The design of QOS is nothing more than queue joining, scheduling, supervision and plastic surgery.

The way to join the team can be based on DSCP, TOS, COS and other tags, and then trust a certain tag to join the team, or you can choose to use strategies to grab other message features to join the team. The final strategy we choose is to grab the queue using message characteristics at the IDC boundary, and rewrite the DSCP,IDC to join the queue only according to DSCP (reduce the use of strategy within IDC to meet high-speed forwarding). In this way, it can not only ensure the trust of DSCP tags, but also reduce the policy complexity within IDC. According to this idea, we set the corresponding policies respectively:

-for ToR downstream port and Border uplink port: grab specific messages and enter specific queues.

-set up the remaining devices and ports: trust DSCP and join the queue by mapping.

Express it in a chart, that is:

■ IDC joined the border team

Order

Match

Action

one

Udp_dport==4791 & &

Dscp==48

Queue 6

two

Udp_dport==4791 & &

Dscp==46

Queue 5

Other

Modify dscp to predefined

* this is the existing marking strategy. We classify businesses within IDC and mark specific DSCP.

* order 1 and 2 are only deployed in ToR of RoCE network.

■ IDC internal DSCP mapping

DSCP

Queue

forty-eight

six

forty-six

five

Other

2...

Next, it's time to talk about scheduling design. The object of scheduling is the data in the cache, that is, scheduling takes effect only when it is congested, and after scheduling takes effect, it will affect the traffic size of each queue.

With the above understanding, we began to dispatch design. In a general RoCE network, the following queues (or traffic) are used:

① protocol signaling class, currently there is only CNP traffic; (other protocols do not span hops, so do not consider)

② RoCE traffic

③ business / management traffic.

These three major types of traffic can continue to be divided into small categories. According to the scheduling model recommended by ETC, we choose the scheduling method of SP+WDRR, that is, class 1 traffic is absolutely priority, scheduling first when the cache is overstocked, until the queue is empty. Type 2 and type 3 traffic is sub-optimal, and the weight value can be flexibly defined according to WDRR scheduling. This ensures that the CNP message is forwarded to the traffic origin server within the 3us (the delay of the single hop of the network without congestion is within 1us).

There is a loophole in the above scheduling design: if the traffic of queue 6 is too large, the low-priority queue may starve to death (that is, it will not be scheduled for a long time). Although in theory, the traffic of queue 6 is generally between tens and hundreds of Mbps, you should still beware of malicious server behavior. Therefore, we limit the bandwidth of SP's queue to its queue usage. This is the so-called supervision and plastic surgery.

IV. Non-destructive involvement and analysis

The traffic of RoCE needs to be guaranteed to run in a lossless queue, which uses PFC technology to send Pause frames against a queue, forcing the upstream to stop the flow.

In Broadcom's XGS series chips, there is a cache management unit MMU (cache for short), which stores messages that have been received but not forwarded, and counts both entrances and exits: "an cell is used for both the entrance of the 0max 1 and the exit of the 0max 2" (cell is the smallest unit of cache resources).

Caching sets an upper limit for each entry and exit, beyond which cell caching messages can no longer be used. Many other waterlines are drawn below the upper limit, and each exit and entrance is further subdivided, which can be counted according to the queue. In the inbound direction, the PG-Guaranteed size, PG-Share size and Headroom size are subdivided; in the outbound direction, the Queue-Guaranteed size and Queue-Share size are subdivided (as shown in the following figure, we do not consider the port, only the queue).

Picture: the direction of inclusion and departure of the team

When using the cache, you always apply for it from the bottom to the top, so you prefer to call the size of these blocks "waterline". When "a certain block" is used, it is said that the "cache water level" has reached "a certain waterline". For example, when the PG-Share block is used, it is called that the entry cache water level has reached the PG-Share waterline. If all blocks are used up, packet loss occurs, which is called no buffer packet loss.

Each size has its own special use, first take a brief look at its function, and then discuss how to set the five waterlines in the lossless queue.

► PG-Guaranteed and Queue-Guaranteed are guaranteed caches, which are exclusive and cannot be preempted by other queues even if they are not used.

► PG-Share and Queue-Share use shared caches because they are not fixed in size because of dynamic waterlines, and if many queues are in use, the waterline of each queue is small if you divide it equally. In addition, PG-Share has another important role: the critical point of PFC transmission, also known as xoff waterline, as long as it reaches this waterline, PFC will be sent out of this mouth, after falling a little bit, it will return to normal.

► Headroom is a special waterline that works only in lossless queues. Imagine, after the PFC is sent out, can the traffic really stop in an instant? The answer is no! Because there is still some data in the cable, and the forwarding processing time of 77, 788 has to be taken into account. So that's what the Headroom space is for.

* * 1, PG-Guaranteed and Queue-Guaranteed

After talking about the basic principles, let's look back at the network design. First, take a look at the PG-Guaranteed and Queue-Guaranteed waterlines, which have little to do with the "lossless queue". To ensure that the cache only meets the basic store-and-forward functions of the switch, it can be configured as a packet size. Then we calculate according to the worst-case scenario, that is, the giant frame of MTU=9216.

But in fact, we don't have to worry about it, because of the dynamic waterline, there is always a remaining cache in the shared cache to use, so just keep the default configuration of the original factory.

2 、 Queue-Share

Next up is the Queue-Share waterline. In the lossless queue, we want to trigger PFC for reverse pressure before cache packet loss, so in any case, the ingress PG-Share should reach the waterline first, and the egress Queue-Share can never reach the waterline (when PG-Share arrives, PFC,Queue-Share arrives and packet loss occurs).

As mentioned before, MMU bookkeeping is one entry for each exit and import. From this point of view, the worst-case scenario should be more than one dozen (all the export accounts are recorded in one queue, and the import accounts are shared in different queues). In order to ensure that the exit waterline never reaches, simply configure the exit waterline as infinite, which turns out to be no problem, because the entrance PG-Share is a dynamic waterline, which can always be triggered before the Buffer goes bankrupt.

In this way, Queue-Share seems to have been done, but it is not. What if TCP traffic is involved in the mash-up? This problem is serious. TCP's Lossy queue will eat a lot of cache, so the corresponding Queue-Share waterline in Lossy queue should also be limited.

3 、 PG-Share

As long as the PG-Share waterline is configured as a dynamic waterline, the size can be adjusted at will without much problem, but an inequality needs to be satisfied: (PG-Share + PG-Guarantee + Headroom) * [number of entrances] ≤ Queue-Share + Queue-Guarantee

The formula describes a scenario in which one port is more than one. The number of entrances selects a larger value according to the actual situation (in the case of ToR, the worst-case scenario is 39 dozen and 1pm 32 25G downlink and 8 100G uplink).

The PG-Share here is a dynamic waterline, which can be expressed by a simple formula: PG-Share = [residual Buffer] * α

The alpha here is the scaling factor, which can be adjusted freely by the user. As you can see, the scaling factor determines the size of the PG-Share waterline. According to the above equation, we only need to set the Queue-Share waterline to the static maximum and the PG-Share to dynamic, and the scaling factor α of the entrance can be arbitrary. Of course, the entrance α can not be set too small, in the case of less and more ports, because the water level at the entrance is very low, when it is evenly distributed to each exit, the water level at the exit is even lower! When the water level of the outlet is too low, you will find that the existing ECN configuration is no longer in effect (for example, the water level of the outlet may not reach half of the Kmax). In our experience, the α of PG-Share in the lossless queue can be configured with 1 to 8, 1 to 4, 1 to 2, and the specific size should be determined by combining the ECN parameters in the congestion design.

4 、 Headroom

The Headroom waterline is very important, but a reasonable configuration can be obtained by experiment + derivation. Let's first look at an equation: [Headroom size] = [time between PFC construction and stop flow] * [port rate] / [number of bits occupied by 64-byte packets]

64-byte packets are used because packets have the lowest cache utilization. A single Cell has more than 200 bytes, but it can only be used exclusively by a single message. Among them, only [time from PFC construction to stop flow] needs to be further decomposed: t = Tm1 + Tr1 + Tm2 + Tr2* Tm1: the time when the downstream PG detects that the xoff is used up to the time when the construction PFC frame is sent.

* time when Tr1:PFC frames are sent from downstream to upstream.

* Tm2: the time when the peer receives the PFC frame until the queue stops.

* Tr2: the time when the message in the cable is transmitted after the queue stops.

It can be seen that in these four times, only the cable length is a variable. After further simplification, we can get: [Headroom size] = (Tm1 + Tm2 + 2 * [cable length] / [signal propagation speed]) * [port rate] / [number of bits occupied by 64-byte packets])

Where Tm1 + Tm2 is constant, which can be measured experimentally, and the rest are known quantities. Finally, according to the formula, we can calculate 100G port, 100m optical fiber, H = 408 cell;25G port, 15m AOC, H = 98 cell. Of course, when you really use it, you have to be a little more redundant, after all, this is the critical value.

5. Deadlock analysis and resolution

When talking about PFC, we have to mention deadlock, deadlock is very harmful, and its transitivity will quickly spread to the whole network, so that the lossless queue of the whole network stops flow. There are many researches on deadlock, among which the more detailed one is a Microsoft paper "Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them".

One of the necessary conditions for deadlock is CBD (Ring Cache dependency). In our networking environment, it is a typical CLOS networking, so there is no CBD and no deadlock risk in the steady state. Moreover, the internal route of the whole POD is not filtered, and the details know each other, and the aggregation uses 4 ~ 8 sets of redundancy. Even if there are two failures, there will be no CBD in the converged topology, that is, there is no risk of deadlock.

Figure: CBD and deadlock

At this point, we have solved the deadlock in the steady state, but we still need to consider one thing: is there a CBD in the process of convergence? In fact, a careful analysis will still exist, we have considered a lot of convergence scenarios, there will indeed be some scenarios, there are micro-loops. If there is a micro-loop, there must be CBD. It turns out that we have also truly simulated the deadlock caused by the micro-loop.

The deadlock problem always has to be solved. We use three methods:

1. Aiming at all kinds of micro-loop scenarios, the network protocol is designed to control the convergence order to avoid the emergence of micro-loop.

two。 For other unknown deadlock risks, the switch's deadlock detection function is used to release the cache (releasing the cache will result in packet loss, but there is disorder / packet loss in the convergence process itself).

3. Properly raise the waterline of PG-Share and use DCQCN congestion control to suppress traffic as much as possible.

Design and Analysis of congestion Control

Network congestion control is a very complex topic, here only talk about some basic design ideas.

The congestion control algorithm used by RoCE is DCQCN,_ "Congestion Control for Large-Scale RDMA Deployments" _. This paper describes this algorithm in detail.

Here is a simple description of the algorithm: the node that maintains this algorithm is the server, that is, both ends of the traffic, and the switch in the middle acts as the transmission node to advertise whether it is congested. The sender is called Reaction Point, the RP; receiver is called Notification Point, and the NP; intermediate switch is called Congestion Point, or CP for short. The sender (RP) starts sending at the highest speed, and if there is congestion along the way, it will be marked ECN to show congestion. When the marked message is forwarded to the receiver (NP), the receiver (NP) will respond to a CNP message and notify the sender (RP). The sender (RP) that receives the CNP message will begin to slow down. When the sender does not receive the CNP message, it starts to speed up again.

The above process is the basic idea of DCQCN. Although the whole algorithm is very complex, but around this basic idea, continue to improve the details of the algorithm (below are NP's state machine and RP's algorithm). There are also many adjustable parameters, such as how much to reduce the speed? Is the efficiency of speed increase positive? How to maintain network congestion? How long is the congestion update cycle? How sensitive is the CNP message? These are all problems, and reasonable parameters need to be found after modeling the traffic.

Figure: recipient

Figure: sender

In DCQCN algorithm, there are many parameters that can be adjusted for RP, NP and CP. The RP and NP nodes are on the server. To be exact, they should be on the network card. The initialization parameters of the network card are optimal, so there is no need to adjust them, so the parameters on the CP need to be adjusted.

The three parameters on CP are actually the three parameters of WRED-ECN, which are Kmin,Kmax,Pmax. The relationship between these three parameters can be shown in the following figure. The horizontal axis is the outgoing queue length, and the vertical axis is the probability that the message is marked. As you can see from the figure, when the queue length exceeds Kmax, the marking probability jumps directly from Pmax to 100%.

According to the above theoretical analysis, we can find the optimal solution step by step through the method of experimental verification and trial and error.

Now imagine: in a congestion scenario, when the egress queue length is less than Kmin, it will not be marked, the egress queue length may increase steadily, and when the queue length exceeds Kmin, DCQCN begins to slow down.

Therefore, the size of the Kmin determines the basic delay of the RoCE network. The messages in these caches are sent by the sender but not acknowledged by the receiver. We call them inflight bytes, which is about equal to the delay bandwidth product. Therefore, the configuration specification of Kmin is less than the expected delay bandwidth product. With this theoretical basis, the measured theory accords with the reality, and the value can be further adjusted according to the measured delay.

We think about Kmax in the same way, following the train of thought just now, that is, the configuration specification of Kmax is less than or equal to the tolerable delay bandwidth product. But it's not that simple this time, because Kmax also determines the slope in the diagram. The slope is also determined by Pmax. Before discussing Kmax and Pmax, we have to introduce the ideal and reality of the whole ECN.

Ideally, the change of marking probability in the definition domain Kmin ~ Kmax is continuous, and the length of the queue is accurate. But contrary to one's wishes, Broadcom chip SDK uses software polling to measure the queue length, and exponentially averages the queue length and the historical value at the moment, and calculates the marking probability. The result of software polling is that the change of marking probability in the definition domain Kmin ~ Kmax is discontinuous. Secondly, the exponential average will make the measured queue length lag (of course, the exponential average also brings benefits, which will not be expanded here).

The impact of this incident is that the theoretical derivation of Pmax, and even Kmin and Kmax have been overturned, please move on: ideally, what is the maximum effective Pmax for a 25G port and a single QP session?

According to the algorithm of NP in DCQCN, if multiple CE tagged packets are received in 50us, only one valid packet will be considered, so the highest CE marking rate should be 20000 packets per second (that is, 1 packet per 50 microseconds). According to this, we calculate the highest effective Pmax, that is, the set Pmax value, as shown in the following table: what is the most effective Pmax when we assume a 25g port and only one QP session? The highest effective Pmax of the last column can be calculated from columns 4 and 5 in the table.

Back to reality, we validate the last row of the table according to the derived data.

Simulate the congestion of the port speed limit, measure the RoCE flow pps=2,227,007 when stable, and then select a set of ECN configurations: Kmin=1cell,Kmax=1400cell,Pmax=1%, theoretically, Pmax has exceeded the most effective value, and theoretically, even in the case of congestion, the outlet water level cannot reach 1400cell, so set up another monitoring item to monitor whether the outlet water level exceeds 1400cell (trigger alarm, not polling). So there is no situation that can not be collected.) this is the first experiment.

For comparison, the second experiment uses another set of ECN configurations, Kmin=800cell,Kmax=1400cell,Pmax=1%, according to the previous analysis, the outlet water level will not exceed the 1400cell, because at the 1400cell water level, the Pmax=1 has already exceeded the highest effective marking probability.

However, the experimental results are not in line with expectations, the first experiment did not trigger the alarm, passed; the second triggered the alarm. This means that at some point, the cache water level exceeds 1400cell! The water level fluctuates and does not stabilize at a certain value! We boldly guess the reason: from the backlog of cache queues to the relief, there are too many places that take time: polling of queue length, exponential averaging algorithm, generation and forwarding of CNP, even data transmission in cables after deceleration, and so on.

In order to solve this problem, we found another way, chose another way: first set a few small goals, and then through a large number of experiments to explore and verify a set of safe and reliable configuration. Although this method is more savage, it is very effective.

► small goal 1: server port throughput should be above 95%

► small goal 2: the PFC transmission rate must not be higher than 5pps for 99% of the time in all traffic scenarios.

► Mini goal 3: server end-to-end latency must not be higher than 80us in any scenario (lower than 40us in 90% scenarios).

For the traffic model, after we design and filter, we select more than 50 kinds of traffic, and finally we get the reasonable parameters that can meet these three small goals at the same time.

I have to say, DCQCN is very difficult to play with, there are many parameters and are related to each other, here is only to provide some practical rules, welcome to discuss in depth.

VI. Summary

In order to make the physical network have the ability to carry RDMA traffic, we choose the network scheme of RoCE, and through the design of QOS, lossless and congestion control, to ensure the physical network lossless forwarding. RoCE lossless network provides strong support for high-performance business systems such as Kuaijie CVM, such as RSSD cloud disk up to 1.2 million IOPS, and 25Gbps's private network line-speed forwarding bandwidth.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.