The practical Transformation of Istio in UAEK 07/04 Update SLTechnology News&Howtos

The practical Transformation of Istio in UAEK

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Why do you need ServiceMesh

UCloud App Engine on Kubernetes (hereinafter referred to as "UAEK") is a Kubernetes-based computing resource delivery platform built within UCloud, which has the characteristics of high availability, disaster recovery across computer rooms, automatic scaling, three-dimensional monitoring, log collection and simple operation and maintenance. It aims to use container technology to improve the efficiency of internal R & D, operation and maintenance, so that developers can devote more energy to business R & D itself, at the same time. So that OPS can deal with daily work such as resource scaling, grayscale release, version change, monitoring and alarm more calmly.

Considering that Kubernetes is originally created for automatic deployment, scaling and containerization, and after the UCloud UAEK team has completed the IPv6 networking research, design and implementation, a mature container management platform will soon be officially launched in multiple availability zones in Beijing second region. Compared with applying to manage virtual machines and deploy application services in the past, Kubernetes does bring real convenience, such as convenient and flexible automatic scaling and × × micro-service architecture, which can achieve disaster recovery across availability zones with simple configuration.

However, microservice brings many new problems to the system architecture, such as service discovery, monitoring, grayscale control, overload protection, request call tracking and so on. You are used to operating and maintaining a group of Zookeeper clusters to achieve service discovery and client load balancing. Can you eliminate the work of OPS Zookeeper after using UAEK? In order to monitor the running status of the business, we all need to add bypass reporting logic to the code. Can we use UAEK to realize monitoring and reporting without invasion and zero coupling?

In addition, in the past, there was a lack of circuit breaker protection strategy between many system modules, and the peak flow paralyzed as soon as the peak flow. can the use of UAEK help the business side avoid large-scale transformation? In the past, troubleshooting problems, especially calling time-consuming links, was always time-consuming and laborious. Can using UAEK provide a convenient tool for locating bottlenecks?

Obviously, a stable Kubernetes platform alone is not enough to solve these problems. Therefore, at the beginning of the UAEK project, the team regarded ServiceMesh as a goal that must be achieved, and any TCP backend service deployed on UAEK can enjoy these features brought by ServiceMesh:

SideCar mode deployment, zero intrusion, micro-service governance code and business code are completely decoupled

Service Discovery Mechanism and load balancing scheduling Integrated with Kubernetes platform

It provides flexible, real-time, no restart, and can manage traffic grayscale based on 7-layer business information.

Provides a unified abstract data reporting API layer for monitoring and access policy control

Use a distributed request link tracking system to quickly trace Bug and locate system performance bottlenecks

Overload protection mechanism, which can automatically trigger the circuit breaker when the request exceeds the design capacity of the system.

Can provide fault simulation injection exercise script before service launch, and conduct fault handling exercise in advance.

In this way, after using UAEK to deploy application services, you can start from a small range and go online according to the grayscale of your account, and through continuous monitoring and observation, you can easily grasp the information such as abnormal fallback of version, expansion of grayscale range, full release, overload protection, location and tracking of abnormal requests, and so on.

Why Istio?

With regard to the implementation of ServiceMesh, we focus on Istio. Through previous research and testing, we found that several features of Istio can well meet the needs of UAEK:

Perfect support for Kubernetes platform

Separation of control plane and data forwarding plane

Sidecar deployment, control all inter-service call traffic, unlimited control

Envoy is used as Sidecar implementation, Envoy is developed with Clipper 11, and runs based on event-driven and multithreaded mechanism. It has good performance and strong concurrency ability, which is comparable to NGINX.

Zero intrusion into business code and configuration files

The configuration is simple, the operation is convenient, API is perfect.

Cdn.xitu.io/2018/8/23/16565d6f6dfeaed9?w=646&h=507&f=png&s=55105 ">

The whole service grid is divided into two parts: the control panel and the data surface. The data side refers to the Envoy container injected into the application Pod, which is responsible for scheduling all traffic between modules. The control plane is divided into three modules: Pilot,Mixer and Citadel. The specific functions are as follows:

Pilot is responsible for obtaining and Watch the service discovery information of the entire cluster from Kubernetes API, and sends the cluster service discovery information and user-customized routing rules and policies to Envoy.

Mixer is divided into two submodules: Policy and Telemetry. Policy is used to provide admission policy control, blacklist and whitelist control, and QPS flow control services to Envoy. Telemetry provides data reporting and log collection services for Envoy to monitor alarms and log queries.

Citadel provides authentication and authentication, administrative credentials, and RBAC for services and users.

In addition, Istio provides operators with a command-line tool called istioctl, which is similar to kubernetes's kubectl. After the operation and maintenance staff have written the routing rules yaml file, they can submit routing rules to the cluster using istioctl.

The principle and process details of the overall work of Istio are very complex, and the technology stack involved has a certain depth and breadth. Here is only a summary of the general process:

The operation and maintenance staff use istioctl or call API to create and modify routing rule policies to the control layer.

Pilot acquires and watch the cluster service discovery information from Kube APIServer.

When deploying the application, Istio injects the Envoy container into the deployment configuration of pod, and Envoy hijacks all TCP traffic in the proxy pod through iptables nat redirect.

Envoy updates the service discovery information and routing rules and policies of the cluster from Pilot in real time, and intelligently schedules traffic within the cluster based on this information.

Before each request is sent, Envoy sends a Check request to Mixer Policy to check whether the request is subject to a policy limit or quota limit. After receiving each request, it will report the basic information of the request to Mixer Telemetry, such as whether the call is successful, return status code, and time-consuming data.

Citadel implements bi-directional TLS client certificate generation and injection, server key and certificate injection, and K8S RBAC access control.

The way of Transformation of Istio in UAEK Environment

After the above research and a series of tests, the UAEK team fully recognizes the design concept and potential value of Istio, and hopes to attract more internal teams to migrate services to the UAEK environment by making use of Istio's rich and powerful micro-service governance functions.

However, in fact, the process of connecting to Istio on UAEK is not plain sailing. When we first started to investigate Istio, Istio was still in version 0.6, and the function was not perfect and could not be used out of the box in the UAEK environment.

The solution of IPv6 problem

The first problem we encounter is that UAEK is a pure IPv6 network environment, while Istio's support for IPv6 traffic is not complete, and some components can not even be deployed in IPv6 environment.

Before introducing specific transformation cases, let's take a look at how Istio Sidecar takes over traffic from business processes.

As described in the figure above, Istio injects two containers into the application Pod: the proxy-init container and the envoy container. The proxy-init container redirects all TCP layer traffic through nat redirect to port 15001 that Envoy listens on by initializing iptables settings. Take the inbound traffic as an example. After receiving the redirected TCP connection, the service port of Envoy uses the SO_ORIGINAL_DST parameter to find the real destination IP address of the TCP connection through the getsocketopt (2) system call, and forwards the request to the real destination IP.

However, we found that in the IPv6 environment, Envoy cannot hijack Pod traffic. Through packet capture observation and source code tracing, it is found that when Pod starts, it will first run an iptables initialization script to complete the nat redirect configuration in pod, hijack all TCP inbound and outbound traffic in the container to the listening port of Envoy, but this initialization script does not have the corresponding operation of ip6tables and discard all IPv6 traffic, so we modify the initialization script to achieve IPv6 traffic hijacking.

Wave after wave just flattened, wave after wave. After the IPv6 traffic hijacking is completed, we find that all TCP traffic accessing the business service port is reset by Envoy, and port 15001 is not opened when entering the Envoy container. Tracing back to the Envoy and Pilot source code, it is found that the listen address sent by Pilot to Envoy is 0 IPv4 address, and we need the Envoy listening address [:: 0]: 15000, so we continue to modify the Pilot source code.

After the above efforts, the application server Pod was finally able to successfully Accept the TCP connection we initiated. But soon, our request connection was closed by the server, and the client immediately received the TCP FIN section as soon as it connected, and the request still failed. By observing the running log of Envoy, it is found that Envoy cannot find the corresponding layer 4 traffic filter (Filter) after receiving the TCP request.

In-depth follow-up of the source code shows that Envoy needs to obtain the real destination address of the hijacked access request through getsocketopt (2) system call, but the implementation related to Envoy exists bug in IPv6 environment, as shown in the following code. Due to the lack of determining the type of socket fd, the parameters passed in by getsocketopt (2) are parameters in the IPv4 environment, so Envoy cannot find the real destination address of the request, so it reports an error and immediately closes the client connection.

As soon as the problem was found, the UAEK team immediately modified the Envoy source code to improve the IPv6 compatibility of the SO_ORIGINAL_DST option of getsocketopt (2), and then submitted the change to the Envoy open source community, which was then incorporated into the current Master branch and updated in Istio1.0 's Envoy image.

At this point, Istio SideCar is finally able to schedule access traffic between services in a UAEK IPv6 environment.

In addition, we also find that the array is out of bounds and the program crashes when dealing with IPv6 format addresses in Pilot, Mixer and other modules, and fix them one by one.

Performance evaluation

Before the release of Istio1.0, performance issues have always been the focus of criticism in the industry. We first examined whether there is an extra layer of replication of traffic after the addition of Envoy, and a Check request needs to be made to Mixer Policy before the request is initiated. Whether these factors will cause unacceptable delay to the business. After a large number of tests, we found that the latency of 5ms in UAEK environment is higher than that without Istio, which is perfectly acceptable for most internal services.

Subsequently, we focused on the architecture of the entire Istio Mesh and concluded that Mixer Policy and Mixer Telemetry could easily become the performance weakness of the entire cluster. Because Envoy needs to make Check requests to Policy services before initiating each request, on the one hand, it increases the delay of the business request itself, on the other hand, it also increases the load pressure on Policy as a single point. We take Http1.1 requests as a sample test, and find that when the entire grid QPS reaches 2000-3000, Policy will have a serious load bottleneck, resulting in a significant increase in the time-consuming of all Check requests, from 2-3ms under normal circumstances to 100-150ms, which seriously aggravates the time-consuming delay of all business requests. This result is obviously unacceptable.

To make matters worse, in Istio 0.8 and earlier, Policy is a stateful service. Some features, such as global QPS Ratelimit quota control, require a single Policy process to record real-time data for the entire Mesh, which means that Policy services cannot solve performance bottlenecks by horizontally expanding instances. After a trade-off, we have now turned off the Policy service and tailored some features, such as QPS global quota limits.

As mentioned earlier, Mixer Telemetry is primarily responsible for collecting calls from Envoy for each request. Version 0.8 of Mixer Telemetry also has serious performance problems. In the pressure test, it is found that when the cluster QPS reaches more than 2000, the memory utilization of Telemetry instances will soar all the way.

After analysis and positioning, it is found that the reason for the increase in Telemetry memory is that the rate of data consumption through various back-end Adapter can not keep up with the rate reported by Envoy, resulting in a rapid backlog of data not processed by Adapter in memory. We immediately removed the impractical stdio log collection feature that comes with Istio, and this problem was greatly alleviated. Fortunately, with the release of Istio1.0, the problem of in-memory data backlog in Telemetry has been solved. Under the same test conditions, a single Telemetry instance is at least capable of data collection and reporting under 3.5W QPS.

Problems, Hope and Future

After a lot of problems, along the way, a ServiceMesh that is available in a production environment has finally come online in a UAEK environment. In the process, other teams in the department, influenced by the UAEK team, began to learn the concept of Istio and try to use Istio in the project. However, there is still a gap between the current situation and our original intention.

Istio is still in high-speed iteration, and both Istio itself and Envoy Proxy are evolving and updating every day. Each version update brings more powerful features, a more concise definition of API, and a more complex deployment architecture. From 0.7.1 to 0.8, the new routing rule v1alpha3 is completely incompatible with the previous API, and the new virtualservice is completely different from the original routerule, causing a lot of trouble for each user.

How to completely avoid the negative impact of upgrading Istio on the existing network, the official still has not given a perfect and smooth upgrade plan. In addition, although the performance of each component has improved significantly from 0.8 to 1.0, from the feedback of the industry, it is not satisfactory to everyone, and it remains to be seen to what extent Mixer's Check caching mechanism can relieve the performance pressure of Policy.

It is worth mentioning that many of the bug we found are also being discovered by other developers in the community and solved one by one. To our delight, the UAEK team is not an isolated island of information. We can feel that the official community of Istio is working hard to iterate at a high speed, always working to solve various issues of concern to developers, and the issue we submitted can be responded within hours. All these make us firmly believe that Istio is a potential project that will be as successful as Kubernetes.

From the experience of UAEK access users, users need to use Istio correctly without in-depth study of Istio documents. UAEK needs to focus on simplifying this process, so that users can be stupid, interface, and customize their own routing rules as our next vision.

The UAEK team has always been committed to reforming the internal R & D process of UCloud, so that R & D can improve efficiency, so that operators are no longer worried, and everyone is happy to work. In addition to continuing to improve ServiceMesh features, UAEK will open more regions and availability zones in the second half of the year, provide more feature-rich consoles, release automated code management packaging, continuous integration (CI/CD) features, and so on.

Introduction to the author

Chen Sui, a senior R & D engineer for UCloud, has been responsible for the development of monitoring systems, Serverless products, PaaS platform ServiceMesh, etc., and has rich experience in distributed system development.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.