In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
The purpose of UCloud public network gateway is to carry the outbound and outbound traffic of public network IP, load balancing and other products. The current implementation of OVS/GRE tunnel/netns/iptables based on Linux kernel supports the existing business very well. At the same time, we are constantly tracking the development of new technologies in the open source community and applying them to the design of next-generation external network gateways. These new features will bring system performance and management capabilities to the next level to meet the needs of the next few years. During the research and development of the scheme design, we found that there were many defects and Bug in the new features. For this reason, we gave back more than 10 patch to the open source community and incorporated them into the kernel version 5.0 to help improve the kernel function and improve the stability.
At present, many multi-tenant public network gateways in the industry are based on OpenFlow's OpenvSwitch (OVS) scheme. However, with the continuous improvement of kernel routing and forwarding function, it is possible to use kernel native route forwarding to design multi-tenant public network gateway system. In this way, we can effectively use traditional iproute2 routing tools and Firewall tools such as iptables and nftables, and with the rise of SwitchDev technology, it is possible to migrate the gateway system to Linux Switch in the future.
Deficiency of existing kernel 3.x
At present, the widely used kernel version is 3.x series, for example, the kernel supported by the full range of CentOS 7 standards is 3.10, and Linux distributions such as Fedora/Ubuntu are also widely used. Under the 3.x series kernel, there are some problems, such as complex IP tunnel management, tenant isolation performance loss and so on.
1. IP tunnel management is complex.
The Linux kernel creates IP tunnel devices to establish point-to-point tunnel connections, specifying tunnel dst and tunnel key when creating. Because of the establishment of a connection between hosts and a large number of destination addresses for hosts, thousands of tunnel devices need to be created on the gateway node. in a large-scale business environment, the management of tunnel will become very complex.
two。 Performance degradation caused by multi-tenant quarantine
a. Public clouds need to implement multi-tenant isolation to ensure security and privacy among users. Because the intranet addresses of different tenants in the VPC network can overlap, which leads to the possibility of route overlap, it is necessary to isolate the tenant's routing rules through a large number of policy routes. because of the linked list attribute of policy routing, the performance will decline sharply with the increase of the length of the linked list.
b. Because the implementation of Firewall and NAT is based on the same chained iptables, the performance loss is also considerable.
3. Netns brings performance overhead
The isolation of tenant routing and Firewall rules is achieved through netns, but netns introduces virtual network card and protocol stack reentry overhead, which reduces the overall performance by about 20%.
Three new kernel technologies
In order to solve the problems existing in the original scheme, we have investigated a large number of mainstream solutions in the industry and new trends upstream of the kernel, and found that the features of three new kernel technologies, Lightweight tunneling (lightweight tunneling, referred to as lwtunnel), Virtual Routing Forwarding (virtual routing forwarding, referred to as VRF) and nftable & netfilter flow offload (stream offload), can help to avoid the shortcomings of the original scheme.
1. Lightweight tunneling
The Linux kernel introduced lightweight tunneling Lightweight tunneling in version 4.3, which provides a way to set the tunnel property through route, which avoids managing a large number of tunnel devices.
The external mode is specified when the tunnel device is created, and the lightweight tunnel set by the route is used to send messages through the tun device.
2. Virtual Routing Forwarding
The Linux kernel introduced preliminary support for VRF in version 4.3, and a complete version was formed in version 4.8. Virtual Routing Forwarding virtual route forwarding can use one Linux Box physical router as multiple virtual routers, which can solve the problem of tenant route isolation and avoid the direct use of policy routing. Therefore, the network cards of different tenants can be added to the virtual routers to which the tenants belong to to realize the virtual routing of multi-tenants.
3. Flow offload
Nftables is a new packet classification framework designed to replace the existing {ip,ip6,arp,eb} _ tables. In nftables, most of the work is done in user mode, and the kernel knows only a few basic instructions (filtering is implemented using pseudo-state machines). One of the advanced features of nftables is mapping, where you can use different types of data and map them. For example, we can map iif device to a dedicated set of rules (the previously created ones are stored in a chain). Because it is the way of hash mapping, the performance overhead of chain rule jump can be perfectly avoided.
The Linux kernel introduced the flow offload feature in version 4.16, which provides stream-based unloading for IP forward. When a new connection completes the first round of messages in the original direction and in the opposite direction, after completing the routing, Firewall and NAT work, after processing the forward hook of the first message in the opposite direction, an unloadable flow is created to the receiving network card ingress hook according to the message routing, NAT and other information. Subsequent messages can be forwarded directly on the receiving ingress hook without going into IP stack processing. In addition, flow offload will support hardware offload mode in the future, which will greatly improve the forwarding performance of the system.
Scheme design and optimization practice
Through the study of the above three new technologies, we find that we can try to design a set of routing-based way to realize the external network gateway of multi-tenant overlay network. In the process of scheme design, we also encountered some problems, such as the lack of functionality of lwtunnel and flow offload, and the inability of VRF and flow offload to work effectively together. In the end, we all managed to solve the problem and submitted patch to the Linux open source community for the shortcomings of these kernels.
1. Lwtunnel sent message tunnel_key lost
Problem description: when we use lwtunnel routing to send a message, we create a gretap tunnel of external type. We set the command id to 1000, but there is no tunnel_key field in the successful message.
Problem location: we studied the iproute2 code and found that because TUNNEL_KEY flag is not open to user mode, the iproute2 tool does not set TUNNEL_KEY for lwtunnel routing, so the message will not create a tunnel_key field.
Commit patch: we submit patch to the kernel and user mode iproute2 to solve this problem:
Iptunnel: make TUNNEL_FLAGS available in uapi
Https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?
Id=1875a9ab01dfa96b06cb6649cb1ce56efa86c7cb
Iproute: Set ip/ip6 lwtunnel flags
Https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/?id=3d65cefbefc86a53877f1e6461a9461e5b8fd7b3
After submitting the patch, you can set up the route in the following ways.
Ip r r 2.2.2.11 via 1.1.1.11 dev tun encap ip id 1000 dst 172.168.0.1 key
2. Lwtunnel is invalid for the IP tunnel of the specified key
Problem found: in order to effectively isolate tenant routing, we create a tunnel_key-based gretap tunnel device for each tenant. As shown in the figure below, creating a gretap tunnel device of tunnel_key 1000 and adding the tunnel device to the tenant's VRF,tunnel device can effectively receive messages, but cannot send messages.
Problem location: the research kernel found that even if IP tunnel specified a lightweight tunnel route in non-external mode, it was not used to send messages, resulting in packet routing errors being discarded.
Submit patch:
Ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnel
Https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d71b57532d70c03f4671dd04e84157ac6bf021b0
After the patch is submitted, under the non-external mode IP tunnel with no tunnel_dst specified, the lightweight tunnel route can be used to send the message.
3. External IP tunnel ARP does not work properly
Problem description: neighbor IP tunnel has made an ARP request, but the tunnel_key field is not included in the tunnel header of the local ARP echo message.
Problem location: the research code found that tunnel received a peer ARP request and copied the tunnel information of the request message when sending a message ARP reply, but missed all tun_flags.
Submit patch:
Iptunnel: Set tun_flags in the iptunnel_metadata_reply from src
Https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7bdca378b2301b1fc6a95c60d6d428408ae4e39e
4. Flow offload cannot work effectively with DNAT
Problem description: the Firewall creation rule receives a message with the destination address 2.2.2.11 from eth0, DNAT is 10.0.0.7, and flow offload does not work.
Problem location: the analysis found that the client 1.1.1.7-> 2.2.2.7 DNAT to server 10.0.0.7, the first reply reverse message (syc+ack) used the wrong destination address to obtain the reverse route
Daddr = ct- > tuplehash [! dir] .tuple.dst.u3.ip
At this point, dir is in the opposite direction, so daddr gets the destination address in the original direction, which is 2.2.2.7, but because it has been DNAT, the real route should not be obtained through 2.2.2.7, but should be obtained according to the value of 10.0.0.7.
Addr = ct- > tuplehash [dir] .tuple.src.u3.ip
Submit patch:
Netfilter: nft_flow_offload: Fix reverse route lookup
Https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a799aea0988ea0d1b1f263e996fdad2f6133c680
Flow offload does not work effectively with VRF
Problem description: after adding network cards eth0 and eth2 to VFR, flow offload does not work.
Problem location: look at the code and find that skb- > dev will be set to vrf device user1 after the original and reverse direction header messages enter the protocol stack, and the iif that creates the flow offload rule is user1. However, the offload rules are issued on the ingress hook of eth0 and eth2, so the subsequent messages cannot match the flow rules on the ingress hook of eth0 and eth2.
Submit patch:
Netfilter: nft_flow_offload: fix interaction with vrf slave device
Https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10f4e765879e514e1ce7f52ed26603047af196e2
Finally, according to the results of finding routes in both directions, we set the iif and oif information of flow offload rules to solve this problem.
6. VRF PREROUTING hook re-entry problem
Problem description: configure the network card to join the VRF,firewall ingress direction rule to receive the message of the destination address 2.2.2.11 and the TCP destination port 22, and the egress direction rule is to discard the message of the TCP destination port 22. Abnormal result: the message received from the destination address 2.2.2.11 TCP 22 destination port is discarded.
Problem location: the study found that the messages received after the network card joins the VRF will enter the PREROUTING hook twice, because when entering the IP stack, it will enter the first PREROUTING hook, and then it will enter the PREROUTING hook again after being taken over by the VRF device. The dst nat of the above rule in rule-1000-ingress chain is 10.0.0.7 for the first time, and the second time because the message will enter rule-1000-egress incorrectly after being DNAT, the message is discarded.
Submit patch: we added a match project to the kernel to determine the type of Nic to avoid the second known invalid reentry in user mode. Kernel mode and user mode nftables submitted the following patch respectively:
Netfilter: nft_meta: Add NFT_META_I/OIFKIND meta type
Https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=0fb4d21956f4a9af225594a46857ccf29bd747bc
Meta: add iifkind and oifkind support
Http://git.netfilter.org/nftables/commit/?id=512795a673f999fb04b84dbbbe41174e9c581430
How to use it:
Nft add rule firewall rules-all meta iifkind "vrf" counter accept
Prototype verification
Finally, we successfully use lwtunnel, VRF and flow offload to implement the prototype verification of multi-tenant extranet gateway. The verification process is as follows:
1. First create a prototype environment.
A. netns cl simulates the public network client, the address is 1.1.1.7 src 172.168.0.7, configure the sending route
B. netns ns1 simulates tenant 1, the internal network address is 10.0.0.7, and the public network address is 2.2.2.11 src 172.168.0.11 tunnel_key 1000, and configure the sending route
C. Netns ns2 simulated tenant 2, the internal network address is 10.0.0.7, and the public network address is 2.2.2.12 src 172.168.0.12 tunnel_key 2000. Configure the sending route.
D. Host simulates the public network gateway, tunnel src 172.168.0.1, creates tenant VRF user1 and use2, creates tenant IP tunnel tun1 and tun2, and configures forwarding route.
The figure of the prototype environment is as follows:
two。 Create a firewall rule:
a. Tenant 1 inbound allows TCP destination port 22 and ICMP access, and egress forbids access to external TCP 22 destination port
b. Tenant 2 inbound allows TCP port 23 and ICMP access, and egress forbids access to external TCP 23 destination ports
c. Flow offload is supported on tenant tun1 and tun2 devices.
Finally, client can successfully access the user1 tcp port 22 service through 2.2.2.11, user1 cannot access the client tcp port 22 service, client can successfully access the user2 tcp port 23 service through 2.2.2.12, and user1 cannot access the client tcp port 23 service.
After the follow-up hardware offload function is improved and the network card manufacturer supports it, we will do further development and verification.
Write at the end
These are some of the core issues involved in this project, and these patch features are available in Linux kernel version 5.0. We have compiled a list of patch contributions to the Linux kernel community during this period, hoping to help developers. Readers can click "read the original" to read the complete patch list.
As a mature open source suite, Linux has always been the mainstream operating system used by cloud manufacturers, but in the iterative process of technology update, some new features will have problems such as stability and compatibility in practical application. While studying the use of upstream technology, we have been actively exploring and enriching the functions of open source technology to help improve the stability of open source technology. And continuously give back the output to the community, and work with the community to build a prosperous open source ecology.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 273
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.