In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
Today, the editor will share with you the relevant knowledge about what the gateway design method based on the new features of the Linux kernel is. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.
The purpose of UCloud public network gateway is to carry the outbound and outbound traffic of public network IP, load balancer and other products. The current implementation of OVS/GRE tunnel / netns/iptables based on Linux kernel supports the existing business very well. At the same time, we are constantly tracking the development of new technologies in the Linux kernel community and applying it to the design of next-generation external network gateways. These new features will bring system performance and management capabilities to the next level to meet the needs of the next few years. In the process of scheme design and development, we found that there are many defects and Bug in the new features, so we gave back more than 10 patches to the Linux kernel community and incorporated them into the Linux kernel version 5.0 to help improve the kernel function and improve the stability.
At present, many multi-tenant public network gateways in the industry are based on OpenFlow's OpenvSwitch (OVS) scheme. However, with the continuous improvement of kernel routing and forwarding function, it is possible to use kernel native route forwarding to design multi-tenant public network gateway system. In this way, we can effectively use traditional iproute2 routing tools and firewall tools such as iptables and nftables, and with the rise of SwitchDev technology, it is possible to migrate the gateway system to Linux Switch in the future.
Deficiency of existing Linux kernel 3.x
At present, the widely used kernel version is 3.x series, for example, the kernel supported by the full range of CentOS 7 standards is 3.10, and Linux distributions such as Fedora/Ubuntu are also widely used. Under the 3.x series kernel, there are some problems, such as complex IP tunnel management, tenant isolation performance loss and so on.
IP tunnel management is complex
The Linux kernel creates an IP tunnel device to establish a point-to-point tunnel connection, specifying the tunnel target and tunnel key. Because the pairwise connection between hosts is established, there are many destination addresses for hosts, which will lead to the creation of thousands of tunnel devices on the gateway node. in a large-scale business environment, tunnel management will become very complex.
Performance degradation caused by multi-tenant quarantine
Public clouds need to implement multi-tenant isolation to ensure security and privacy among users
Because the intranet addresses of different tenants in the VPC network can overlap, which leads to the possibility of route overlap, it is necessary to isolate the tenant's routing rules through a large number of policy routes. because of the linked list attribute of policy routing, the performance will decline sharply with the increase of the length of the linked list.
Because the implementation of firewall and NAT is based on the same chained iptables, the performance loss is also considerable.
Netns brings performance overhead
The isolation of tenant routing and firewall rules is achieved through netns, but netns introduces virtual network card and protocol stack reentry overhead, which reduces the overall performance by about 20%.
Three new kernel technologies
In order to solve the problems existing in the original scheme, we have investigated a large number of mainstream schemes in the industry and new trends in the upper reaches of the kernel, and found that the characteristics of three new kernel technologies, lightweight tunneling (Lightweight tunneling), virtual route forwarding (VRF) and nftable & netfilter stream offload (flow offload), can help to avoid the shortcomings of the original scheme.
1. Lightweight tunnel
The Linux kernel introduced lightweight tunneling in version 4.3, which provides a way to set tunnel properties through routing, which avoids managing a large number of tunnel devices.
The external mode is specified when the tunnel device is created, and the lightweight tunnel set by the route is used to send messages through the tun device.
# ip l add dev tun type gretap external# ifconfig tun 1.1.1.7 via 24 up# ip r r 2.2.2.11 via 1.1.1.11 dev tun encap ip id 1000 dst 172.168.0.1 key2, virtual route forwarding
The Linux kernel introduced preliminary support for VRF in version 4.3, and a complete version was formed in version 4.8. Virtual route forwarding can use a Linux Box physical router as multiple virtual routers, which can solve the problem of tenant route isolation and avoid the direct use of policy routing. Therefore, the network cards of different tenants can be added to the virtual routers to which the tenants belong to to realize the virtual routing of multi-tenants.
# ip link add user1 type vrf table1# ip link add user1 type vrf table2# ip l set user1 up# ip l set user2 up# ip l set dev eth2 master user1# ip l set dev eth2 master user2# ip r a default via 192.168.0.1 dev eth2 table1 onlink# ip r a default via 192.168.0.1 dev eth2 table2 onlink
3. Stream unloading
Nftables is a new packet classification framework designed to replace the existing {ip,ip6,arp,eb} _ tables. In nftables, most of the work is done in user mode, and the kernel knows only a few basic instructions (filtering is implemented using pseudo-state machines). One of the advanced features of nftables is mapping, where you can use different types of data and map them. For example, we can map iif devices to a dedicated set of rules (the previously created ones are stored in a chain). Because it is the way of hash mapping, the performance overhead of chain rule jump can be avoided.
The Linux kernel introduced stream unloading in version 4.16, which provides stream-based offload for IP forwarding. When a new connection completes the first round of messages in the original direction and in the opposite direction, after completing the routing, firewall and NAT work, the forward hook in the opposite direction of the first message is processed, and an unloadable stream is created to the receiving network card ingress hook according to the message routing, NAT and other information. Subsequent messages can be forwarded directly on the receiving ingress hook without entering the IP stack for processing. In addition, the future stream offload will also support hardware offload mode, which will greatly improve system forwarding performance.
# nft add table firewall# nft add flowtable f fb1 {hook ingress priority 0\; devices = {eth0, eth2}\;} # nft add chain f ftb-all {type filter hook forward priority 0\; policy accept\;} # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @ fb1 Scheme Design and Optimization practice
Through the study of the above three new technologies, we find that we can try to design a set of routing-based way to realize the external network gateway of multi-tenant cascading network. In the process of scheme design, we also encountered some problems, such as insufficient lwtunnel and stream unloading capabilities, and VRF and stream unloading can not work effectively together. In the end, we all managed to solve the problem and submitted patches to the Linux kernel community for these kernel deficiencies.
1. Lwtunnel sent message tunnel_key lost
Problem description: when we use lwtunnel routing to send a message, we create an external type gretap tunnel. We set the command id to 1000, but there is no tunnel_key field in the successful message.
# ip l add dev tun type gretap# ifconfig tun 1.1.1.7/24 up# ip r r 2.2.2.11 via 1.1.1.11 dev tun encap ip id 1000 dst 172.168.0.1
Problem location: we studied the iproute2 code and found that because the TUNNEL_KEY flag is not open to the user state, the iproute2 tool does not set TUNNEL_KEY for lwtunnel routing, so the message will not create a tunnel_key field.
Submit patches: we submit patches to the kernel and user mode iproute2 respectively to solve this problem:
Iptunnel: make TUNNEL_FLAGS available in uapi
Iproute: Set ip/ip6 lwtunnel flags
After a patch has been submitted, the route can be set up in the following ways:
Ip r r 2.2.2.11 via 1.1.1.11 dev tun encap ip id 1000 dst 172.168.0.1 key2, lwtunnel is invalid for IP tunnel with specified key
Problem found: in order to effectively isolate tenant routing, we create a tunnel_key-based gretap tunnel device for each tenant. As shown in the figure below, create a gretap tunnel device for tunnel_key 1000 and add the tunnel equipment to the tenant's VRF. The tunnel device can effectively receive messages, but cannot send messages.
# ip l add dev tun type gretap key 1000# ifconfig tun 1.1.1.7/24 up# ip r r 2.2.2.11 via 1.1.1.11 dev tun encap ip id 1000 dst 172.168.0.1 key
Problem location: the research kernel found that even if the IP tunnel specifies a lightweight tunnel route in non-external mode, it is not used to send messages, resulting in packet routing errors being discarded.
Submit patches:
Ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnel
After the patch is submitted, under the non-external mode IP tunnel with no tunnel_dst specified, the lightweight tunnel route can be used to send the message.
3. The external IP tunnel ARP cannot function properly.
Problem description: the neighbor IP tunnel made an ARP request, but there is no tunnel_key field in the tunnel header of the local ARP echo message.
# ip l add dev tun type gretap external# ifconfig tun 1.1.1.7/24 up# ip r r 2.2.2.11 via 1.1.1.11 dev tun encap ip id 1000 dst 172.168.0.1 key
Problem location: the research code found that the tunnel received a peer ARP request, and the tunnel information of the request message was copied when sending a message ARP reply, but all tun_flags was omitted.
Submit patches:
Iptunnel: Set tun_flags in the iptunnel_metadata_reply from src
4. Stream unloading does not work effectively with DNAT
Problem description: the firewall creation rule receives a message with a destination address of 2.2.2.11 from eth0 with a DNAT of 10.0.0.7. Stream unloading does not work.
Problem location: the analysis found that the reverse message (syc+ack) from the client 1.1.1.7-> 2.2.2.7 DNAT to the server 10.0.0.7 replied used the wrong destination address to obtain the reverse route.
Daddr = ct- > tuplehash [! dir] .tuple.dst.u3.ip
At this time, dir is in the opposite direction, so daddr obtains the destination address in the original direction, which is 2.2.2.7, but because it is DNAT, the real route should not be obtained through 2.2.2.7, but should be obtained according to the value of 10.0.0.7.
Addr = ct- > tuplehash [dir] .tuple.src.u3.ip
Submit patches:
Netfilter: nft_flow_offload: Fix reverse route lookup
5. Stream unloading does not work effectively with VRF
Problem description: after the network card eth0 and eth2 are added to the VFR, the stream uninstall does not work.
# ip addr add dev eth0 1.1.1.1/24# ip addr add dev eth2 1.1.1.1/24# ip link add user1 type vrf table 1# ip l set user1 up# ip l set dev eth0 master user1# ip l set dev eth2 master user1
Problem location: looking at the code, it is found that skb- > dev will be set to vrf device user1 after the original and reverse direction header messages enter the protocol stack, and the iif that creates the stream unloading rule is user1. However, the unloading rules are sent on the ingress hooks of eth0 and eth2, so the subsequent messages cannot match the flow rule rules on the ingress hooks of eth0 and eth2.
Submit patches:
Netfilter: nft_flow_offload: fix interaction with vrf slave device
Finally, according to the results of finding the route in both directions, we set the iif and oif information of the flow unloading rule to solve this problem.
6. VRF PREROUTING hook reentry problem
Problem description: configure the network card to join VRF, the firewall ingress direction rule is to receive the message of destination address 2.2.2.11 and TCP destination port 22, and the egress direction rule is to discard the message of TCP destination port 22. Abnormal result: the message received from the destination address 2.2.2.11 TCP 22 destination port is discarded.
Problem location: the study found that the messages received by the Nic will enter the PREROUTING hook twice after joining the VRF, because the PREROUTING hook will be entered * times when entering the IP stack, and then it will enter the PREROUTING hook again after being taken over by the VRF device. The dst nat of the above rule * is 10.0.0.7 in rule-1000-ingress chain for the second time, the message is discarded because the message will enter rule-1000-egress incorrectly after being DNAT.
Submit patch: we have added a match project to the kernel to determine the type of Nic to avoid the second known invalid reentry in user mode. Kernel mode and user mode nftables have submitted the following patches respectively:
Netfilter: nft_meta: Add NFT_META_I/OIFKIND meta type
Meta: add iifkind and oifkind support
How to use it:
Prototype Verification of nft add rule firewall rules-all meta iifkind "vrf" counter accept
Finally, we successfully use lwtunnel, VRF and stream offload to realize the prototype verification of multi-tenant public network gateway. The verification process is as follows:
1. First create a prototype environment
A. netns cl simulates the public network client with an address of 1.1.1.7 and a tunnel source address of 172.168.0.7 and configures the sending route.
B. netns ns1 simulated tenant 1, internal network address 10.0.0.7, public network address 2.2.2.11, tunnel source address 172.168.0.11 tunnel_key 1000, configure sending route
C. Netns ns2 simulated tenant 2, internal network address 10.0.0.7, public network address 2.2.2.12, tunnel source address 172.168.0.12 tunnel_key 2000, configure send route
D. Host simulates the public network gateway with the tunnel source address of 172.168.0.1, creates tenant VRF user1 and use2, creates tenant IP tunnel tun1 and tun2, and configures forwarding route.
The figure of the prototype environment is as follows:
2. Create firewall rules
a. Tenant 1 inbound allows TCP destination port 22 and ICMP access, and egress forbids access to external TCP 22 destination port
b. Tenant 2 inbound allows TCP port 23 and ICMP access, and egress forbids access to external TCP 23 destination ports
c. Stream offloading is supported on tenant tun1 and tun2 devices.
Finally, the client can successfully access the user1 tcp port 22 service through 2.2.2.11, but user1 cannot access the client tcp port 22 service; the client can successfully access the user2 tcp port 23 service through 2.2.2.12, and user1 cannot access the client tcp port 23 service.
After the follow-up hardware functions are improved and the network card manufacturers support, we will do further development and verification.
These are all the contents of the article "what is the gateway design method based on the new features of the Linux kernel?" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 226
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.