Detailed explanation of BigBrother:UCloud full-link large-scale network connectivity detection system 04/16 Update SLTechnology News&Howtos

Detailed explanation of BigBrother:UCloud full-link large-scale network connectivity detection system

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Virtual network troubleshooting is difficult, traditional traceroute and other tools are difficult to play a big role, in most cases, you need to go to the host, hybrid cloud gateway to grab packets to troubleshooting, time-consuming and laborious. In some scenarios, the delivery path of packets is relatively long (such as cross-domain, hybrid cloud, etc.), and there may be more places where packets are lost, which increases the difficulty of troubleshooting.

For this reason, we design an internal network connectivity detection system BigBrother, which supports large-scale full-link connectivity. Dyeing based on TCP messages can distinguish detection messages from user traffic, support physical clouds and complex cross-region scenarios, and create a complete detection framework to help OPS colleagues directly locate the point of failure, or one-click to determine whether there is a problem in the virtual network.

BigBrother is used to verify the connectivity of CVM before and after migration immediately after it is launched to ensure that it can be alerted and rolled back in time if an exception occurs. In the two months since the beginning of August, more than 2000 hosts have been migrated, and nearly 10 abnormal migrations have been found in time.

Cdn.xitu.io/2019/10/23/16df75a8babbd4c9?w=640&h=293&f=jpeg&s=15073 ">

I. the shortcomings of the first generation of network connectivity tools

Before designing BigBrother, we also have the first generation of network connectivity checking tools, the principle is to jump to the target host through SSH, use the packet out command of ovs to send the constructed message, and finally tcpdump the message on the host machine of the opposite end, so as to verify the connectivity between the two ends. However, from its principle, it is not difficult to see that this detection method has great shortcomings:

Detection is inefficient, and neither ssh, packet out nor tcpdump can support large-scale rapid inspection.

The adaptation scenario is limited, and for some dpdk and p4 gateway products, packet capture can not be determined by tcpdump.

Therefore, it is very necessary to make a connectivity detection system that supports large-scale full-link. Our goal is to enable operation and maintenance and NOC students to quickly discover and solve the problem of network disconnection, and at the same time protect our virtual network service changes.

Second, the realization principle of BigBrother

The word BigBrother (hereinafter referred to as BB) comes from George Orwell's novel 1984. Naming this detection system BigBrother means that the connectivity of resources across the network can be monitored in real time. The whole BB detection system is completed by the cooperation of several components. Mafia provides console to create and display the results of task, minitrue is used to convert the parameters passed by users into the range of injection packets, and telescreen is used to construct messages and send and receive messages.

1. Entrypoint and Endpoint

Before introducing the principle of BB in detail, let's look at two concepts. In our virtual network, each instance (uhost, umem, udb, etc.) accesses the virtual network through an access point, which consists of two parts:

Entrypoint: inbound/outbound messages are received and sent via Entrypoint.

Endpoint: connect the endpoint of the instance, and Endpoint is the network element closest to the instance.

For example, in public cloud scenarios, entrypoint and endpoint are both openvswitch, while in physical cloud scenarios, entrypoint is our physical cloud forwarding gateway (vpcgw, hybridgw), and endpoint is the uplink ToR of physical CVMs.

These are the descriptions of access points in various scenarios. The reason for clarifying these two concepts is that in the BB system, we use Entrypoint as the injection point and send GRE probe messages to it. At the same time, we use Endpoint as the sampling point, and Endpoint will identify and mirror special probe messages to BB.

2. Testing process

The detection scheme, as shown in the figure, can be divided into two parts, and the flow direction in the diagram is divided into orange and purple.

Take the orange flow part as an example (SRC- > DST): 1) BigBrother simulates DST to send a probe message to Endpoint; 2) Entrypoint on SRC receives the probe message and forwards it to Endpoint;3) Endpoint mirrors the message to BigBrother;4) Endpoint forwards the message to the instance normally; 5) the instance reply message to Endpoint;6) Endpoint receives the reply message and GRE encapsulates it, and then mirrors it to BigBrother;7) Endpoint forwards the message to Entrypoint normally 8) SRC Entrypoint sends reply message to DST Entrypoint;9) DST Entrypoint receives the reply message and sends it to Endpoint;10) DST Endpoint mirrors the reply message to Bigbrother.

At this point, the unilateral test is over. In the process of detection, BigBrother sent one probe message and received a total of three sampling messages. Through the analysis of these three sampling points, we can confirm whether the communication in the direction of SRC- > DST is normal.

And vice versa, the purple part works the same way. After all testing, BigBrother can receive a total of 6 probe messages, and if all 6 messages are received, the connectivity is normal.

3. Probe message design

The detection process of BB is introduced above. Let's take a look at the design and implementation of the probe message and forwarding surface. There are many differences in the design of public clouds and hybrid clouds. The public cloud forwarding surface needs to detect the request and response of the message at the global hook point (table_1), then dye and mirror to BB and other steps. On the other hand, the hybrid cloud forwarding interface requires ToR and PE switches to enable the ERSPAN feature to mirror the stained messages to BB.

The overall packet interaction is shown in the following figure:

A qualified detection message should first of all have the following characteristics:

Dyeing information has nothing to do with host and OS.

Ovs2.3 and ovs2.6 versions (the main version of the current network) can identify and process this kind of staining information.

Therefore, we compare the following two candidates in detail.

1) icmp + tos scheme

The first scheme takes the icmp message as the carrier, uses tos to dye the icmp_request, and mirrors the icmp message of the tos to BB when collecting.

Cookie=0x20008,table=1,priority=40000,metadata=0x1,icmp,icmp_type=8,icmp_code=0,nw_tos=0x40 actions=Send_BB (), Learn (), Back_0 ()

The flow for hook icmp_request can be simplified to the following logic: the action part is mainly composed of three parts:

Send_BB () gives the message mirror image to BB.

Learn () learns a flow for matching icmp_reply messages through icmp_request messages. The main actions of the flow include: dyeing and mirroring to BB

1. REG3 6420 (global hook) reg3 load:64200- > NXM_NX_REG3 [], # 2. Learn action learn (table=31,idle_timeout=2,hard_timeout=4,priority=30000,dl_type=0x0800,ip_proto=1,icmp_type=0,icmp_code=0,NXM_OF_IP_SRC [] = NXM_OF_IP_DST [], NXM_OF_IP_DST [] = NXM_OF_IP_SRC [], Stain (), Send_BB ()), # 3. REG3 0loadlle0-> NXM_NX_REG3 []

Back_0 () sends the message back to table_0 for regular forwarding.

The flow for hook icmp_reply can be simplified to the following logic:

Cookie=0x20008,table=1,priority=40000,metadata=0x1,icmp,icmp_type=0,icmp_code=0,nw_tos=0x40

The action part mainly consists of four parts: Save (in_port, tun_src) saves the in_port and tun_src in the message; Resubmit (table=31) jumps to table31, matches the flow; Restore generated by icmp_request learn (in_port, tun_src) to recover in_port and tun_src; Back_0 () to send the message back to table_0 for routine forwarding operations. What is discussed above is the dyeing and mirroring methods of ovs on the public cloud side, while switch ERSPAN is required to support the hybrid cloud side. In order for the ERSPAN rule to mirror the tos staining message, the tos in the outer Ip Header of the GRE needs to inherit the tos marked in the overlay Ip Header, so the tunnel attribute of inheriting the inner tos needs to be set for the GRE tunnel in the whole network. Execute the command as follows:

Ovs-vsctl set in options:tos=inherit

Although this scheme can achieve the functions of dyeing and mirroring, the flow embedded in hook points is too complex and not easy to maintain. the most important point is that in hybrid cloud networks, this scheme can not support learn flow, so it can not dye the reverse traffic.

2) tcp scheme

The second scheme takes the tcp message as the carrier, uses the specific port as the dyeing condition, and mirrors the tcp message of this source and destination port to BB when collecting.

Cookie=0x20008,table=1,priority=40000,tcp,metadata=0x1,tp_src=, tp_dst= actions=Send_BB (), Back_0 ()

The flow for hook tcp_request can be simplified to the following logic:

The action part mainly consists of two parts: Send_BB () sends the message mirror image to BB; Back_0 () and sends the message back to table_0 for routine forwarding operation.

Compared with the above two schemes, it is not difficult to see that the first scheme is more dependent and the applicable scenario is limited, so BB adopts the second scheme. However, the tcp scheme also has some defects, how to choose the dyed port and distinguish it from the user traffic, which is a difficult point. After several times of trampling analysis, we finally decided to use tcp source destination port=11 for dyeing (users have been informed that they will use TCP port 11 for scanning. For more information, please see https://docs.ucloud.cn/network/unet/faq.

The message is shown in the following figure.

4. Detect the life cycle of messages in each scenario.

BB is designed to support a variety of network scenarios and can cope with the network complexity of physical cloud and cross-domain interworking. In this chapter, we take the detection of physical clouds and cross-domains as examples to analyze the life cycle of BB probe messages in detail.

Physical cloud

In the physical cloud scenario of public cloud interconnection, the life cycle of the probe message is as follows:

Public Cloud-> physical Cloud

1) BigBrother sends probe messages to the public cloud host

2) after receiving the message, ovs mirrors the message to BigBrother3) ovs sends the message to instance 4) ovs mirrors the response message to BigBrother6) the physical cloud core switch receives the message and sends it to the aggregation switch 7) 8) 9) 10) the physical cloud aggregation switch sends the message back to the aggregation switch after processing the message. 11) configure ERSPAN in the physical cloud aggregation switch to mirror the message to BigBrother.

Physical cloud-> public cloud

1) BigBrother sends a probe message to vpcgw

2) 3) vpcgw processes the message and sends it back to the aggregation switch 4) configure ERSPAN on the physical cloud aggregation switch and mirror the message to BigBrother5) the aggregation switch sends the message to the uplink Tor6 of phost) Tor sends the message to phost7) phost echo message 8) configure ERSPAN on the uplink Tor of phost, send the message mirror to BigBrother9) the message is mirrored to the public cloud host ovs10) ovs mirrors the message to BigBrother after receiving the message

Cross-domain gateway

In the scenario of public cloud cross-domain interconnection, the life cycle of the probe message is as follows:

Local region-> region B

1) BigBrother sends a probe message to the local host

2) ovs mirrors the message to BigBrother3) ovs sends the message to instance 4) instance response message 5) ovs mirrors the response message to BigBrother6) ovs sends message to sdngw7) sdngw mirrors the message to BigBrother

Region B-> Local region

1) BigBrother sends a probe message to the local sdngw

2) sdngw mirrors the message to BigBrother3) sdngw sends the message to peer sdngw for forwarding 4) Local sdngw receives peer response message 5) sdngw mirrors response message to BigBrother6) sdngw sends message to local domain host 7) ovs mirrors message to BigBrother

III. Bigbrother service framework

The whole BB detection system is completed by the cooperation of several components. Minitrue is used to convert the parameters passed by users into the range of injection packets, and telescreen is used to construct messages and send and receive messages.

1. Service framework diagram

API: HTTP interface provided by FE service to create tasks and query task progress

Logic: business processing layer, which is used to analyze the parameters and convert them into several source host pairs and put them into Disruptor; Disruptor: this component is an open source high-performance queue; Sender: assemble the data of pop in Disruptor into GRE packet and send it to EntryPoint;Receiver: receive the GRE packet;Analysis reported from EndPoint: store the received message in memory, and analyze the message.

2. Implementation and result analysis of Task.

1) task

We described the design and life cycle of BB probe message in detail above, but we still have one problem to solve: improving the concurrency ability of BB. According to the above description, each BB can only perform one probe, and sequential execution can ensure the accuracy of the test results, so we design to use the sequence number in the TCP header to improve concurrency.

The following is the header structure of a TCP message:

The 32-bit Seq sequence number is what we want to use. During the BB probe, each Seq sequence number only "identifies" a pair pair, and then we divide the Seq sequence number into two parts:

Task_id: it is necessary to identify a Task. Since there are only 5 digits, no more than 32 Task can be entered at a time.

Pair_id: used to identify a pair pair to be detected within a Task.

Therefore, we can increase the number of concurrent BB tasks to 32, and the maximum number of detection pair logarithms supported by each task can reach 2 to the power of 27, which means that each task can support a VPC with a capacity of 10000 CVMs for Full Mesh detection, which is sufficient to cover the network scale of existing users.

2) implementation of task

When OPS students click to create a BB task on the mafia (task console) to check connectivity, they will go through the following processes:

The request is sent to the minitrue service, and the detection range is determined according to the input parameters; minitrue sends the calculated detection range to the telescreen service in the form of a list of source and destination nodes; telescreen builds Gre messages and puts them in a high-performance queue to send packets; at the same time, telescreen will monitor the network card to obtain the mirrored messages and store them in memory; the minitrue analysis program regularly obtains the packet receiving results of telescreen and analyzes them. Finally, the operation and maintenance students can see the final detection results on mafia.

3) Analysis of the results of task

After the execution of task, operation and maintenance students can look up the final test report in mafia, including the total number of pair sent, the number of pair received, and the number of successes and failures. At the same time, the details of the source destination of the failed detection will also be displayed and finally presented in the form of bitmap. 0 indicates that the message has not been received and 1 indicates that the message has been received.

We take the results of the following figure as an example to explain its meaning. In the figure, the bi-directional connectivity of ip pair (10.9.88.16010.8.17.169) is tested.

Let's review the flow chart of BigBrother detection in Chapter 2. First, BigBrother will simulate 10.9.88.160 to send a probe message to the host of 10.8.17.169. The message content is. If 10.8.17.169-> 10.9.88.160 one-way connectivity is normal, BigBrother will eventually receive three messages:

(1)

(2)

(3)

The result of the last three digits of bitmap above is 111, indicating that all three messages have been received, that is, 10.8.17.169-> 10.9.88.160 one-way connectivity is normal.

And vice versa, the first three bits indicate 10.9.88.160-> 10.8.17.169 one-way connectivity, the result is 100,2) (3) the message is not received, that is, 10.9.88.160-> 10.8.17.169 one-way connectivity is abnormal, and the virtual machine 10.9.88.160 does not reply to the message, it can be concluded that the inner exception of the virtual machine or the existence of iptables rules inside the virtual machine will filter the probe message.

3. Connectivity check based on active flow

As mentioned above, OPS students can create BB task on mafia to check connectivity, and determine the scope of detection by passing in mac, subnet id and VPC id, and then carry out full verification. But in most scenarios, we do not need to check the full interconnection, which is not only a waste of time, but also puts some pressure on the control plane. We only need to verify connectivity for active flow within a specified range, so we introduce an active flow detection service-river. River is an analysis system of 100 million-level active flow in virtual networks. With this system, BB can get the active communication sources of users, similar to the hot data in the cache, so that BB can quickly and accurately verify changes.

Different from the full BB probe above, minitrue does not need to calculate the list of source and destination nodes on its own. Instead, it only needs to specify a range to obtain the active list from river, and then send the list to telescreen to send packets through the regular detection process.

IV. Put into use and future plans

BigBrother has participated in the resource integration project since its launch, which is used to verify the connectivity of CVMs before and after migration to ensure that an exception can be alerted and rolled back in time. In the two months since the beginning of August, more than 2000 hosts have been migrated, and nearly 10 abnormal migrations have been found in time.

At the same time, we also have some plans for the subsequent versions of BigBrother, such as:

In addition to the detection of connectivity, it also needs to have the detection of average delay, maximum delay pair and packet loss rate.

It is intended to build private network round-the-clock monitoring for specified users based on BigBrother.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.