What are the 10 common exceptions in istio 07/01 Update SLTechnology News&Howtos

What are the 10 common exceptions in istio

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the 10 common anomalies in istio". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the 10 common anomalies in istio".

1. Service port naming constraint

Istio supports multiple platforms, but the compatibility of Istio and K8s is the best, regardless of design philosophy, core team or community. However, the adaptation of istio and K8s is not completely conflict-free. A typical problem is that istio requires K8s service to port naming ports according to the protocol.

The traffic anomaly caused by port naming does not meet the constraints is the most common problem in the process of using mesh. The phenomenon is that the flow control rules related to the protocol do not take effect, which can usually be located by checking the type of filter in the port LDS.

Reason

The network of K8s is not aware of the application layer. The main traffic forwarding logic of K8s occurs on node and is implemented by iptables/ipvs. These rules do not care about what protocols are in the application layer.

The core capability of istio is to manage layer-7 traffic, but the prerequisite is that istio must know what protocol each regulated service is. Istio will issue different flow control functions (envoy filter) according to different port protocols. K8s resource definition does not include layer-7 protocol information, so istio needs to be provided explicitly by users.

Istio's solution: Protocol sniffing

Protocol sniffing Summary:

Detect TLS CLIENT_HELLO to extract SNI, ALPN, NPN and other information

Based on the known typical structure of common protocols, this paper attempts to detect the content of application layer plaintext a. Based on HTTP2 spec: Connection Preface, to judge whether it is HTTP/2 b. Judge whether it is HTTP/1.x based on HTTP header structure

Timeout control and detection packet size limits are set during the process, which are handled according to the protocol TCP by default.

Best practic

Protocol sniffing reduces the configuration required for beginners to use istio, but may lead to uncertain behavior. Uncertain behavior should be avoided as much as possible in a production environment.

Some examples of sniffing failures:

The client and server use some kind of non-standard seven-layer protocol, which can be parsed correctly, but there is no guarantee that istio auto-sniffing logic recognizes this kind of non-standard protocol. For example, for the http protocol, the standard newline separation is CRLF (0x0d 0x0a), but most http class libraries use and recognize LF (0x0a) as the separation.

For some custom proprietary protocols, the initial format of the data stream is similar to that of the http message, but the subsequent data flow is in a custom format: when sniffing is not enabled, the data stream is routed according to L4 TCP, which is in line with user expectations if sniffing is enabled: the data stream will initially be identified as the L7 http protocol, but the subsequent data does not conform to the http format, and the traffic will be interrupted.

It is recommended that protocol sniffing is not used in the production environment, and the service connected to mesh should be named using the protocol prefix as agreed.

two。 Exception description of sending order problem under flow control rules

In the process of updating traffic rules in batches, traffic anomalies occasionally occur. The RESPONSE_FLAGS in the envoy log contains a "NR" flag (No route configured), which does not last long and will be automatically restored.

Cause analysis

When users use kubectl apply-f multiple-virtualservice-destinationrule.yaml, the order of propagation and effectiveness of these objects is not guaranteed, the so-called final consistency, for example, a child version of a DestinationRule definition is referenced in VirtualService, but the propagation and effectiveness of this DestinationRule resource may lag behind the VirtualService resource in time.

Best practices: make before break

Split the update process from a batch step into multiple steps to ensure that non-existent subset is not referenced throughout the process:

When adding a DestinationRule subset, you should first apply DestinationRule subset, wait for the subset to take effect, and then apply refers to the VirtualService of the subset.

When deleting a DestinationRule subset, you should first delete the reference to the subset in the VirtualService, wait for the changes to the VirtualService to take effect, and then delete the DestinationRule subset.

3. Request interrupt analysis

Is the request exception caused by istio traffic control rules or the return of the business application, and which specific pod does the traffic breakpoint occur?

This is the most common dilemma of using mesh. After introducing envoy as a proxy in micro-services, when traffic access does not match the expected behavior, it is difficult for users to quickly determine which part of the problem lies. The abnormal response received by the client, such as 403,404,503, or connection disruption, may be the result of traffic control performed by any sidecar in the link, but it may also be a reasonable logical response from a service.

Envoy traffic model

Envoy request traffic is called Downstream,Envoy request traffic is called Upstream. In the process of processing Downstream and Upstream, two traffic endpoints are involved, that is, the originator and receiver of the request:

In this process, envoy will calculate the set of forwarding destination hosts that meet the conditions according to the user rules, which is called UPSTREAM_CLUSTER, and according to the load balancing rules, select a host from this set as the receiving endpoint for traffic forwarding, and this host is UPSTREAM_HOST.

The above is the five tuples of traffic processed by envoy requests, which is the most important part of the envoy log. Through this quintuple, we can accurately observe where the traffic comes from and where to go.

UPSTREAM_CLUSTER

DOWNSTREAM_REMOTE_ADDRESS

DOWNSTREAM_LOCAL_ADDRESS

UPSTREAM_LOCAL_ADDRESS

UPSTREAM_HOST

Log Analysis exampl

Focus on observing two pieces of information through the log:

Where is the breakpoint?

What's the reason?

Example 1: a normal client-server request

You can see that the logs on both sides contain the same request ID, so the traffic analysis can be concatenated.

Example 2: no healthy upstream, for example, the number of target deployment health copies is 0

The flag "UH" in the log indicates that there is no healthy host in upstream cluster.

Example 3: No route configured, such as lack of corresponding subset for DestinationRule

The flag "NR" in the log indicates that the route cannot be found.

Example 4, Upstream connection failure, for example, the service is not listening to the port properly.

The flag "UF" in the log indicates that the Upstream connection failed, from which the location of the traffic breakpoint can be determined.

4. Description of abnormal startup sequence of sidecar and user container

The Sidecar pattern is very popular in the kubernetes world, but for the current k8s (V1.17), there is no concept of sidecar, and the role of the sidecar container is subjectively assigned by the user.

A common problem for Istio users is the startup sequence of sidecar and user containers:

The startup order of sidecar (envoy) and user container is uncertain. If the user container is started first and the envoy has not been started yet, if the user container sends a request out, the request will still be blocked and sent to the unstarted envoy. The request is abnormal.

During the termination phase of Pod, there will be similar exceptions, which are still rooted in the uncertainty of the life cycle of sidecar and normal containers.

Solution

At present, the conventional circumvention schemes are mainly as follows:

The start of the business container is delayed by a few seconds, or it fails to try again

Actively detect whether envoy is ready in the startup script, such as 127.0.0.1pur15020 / healthz/ready

No matter which solution appears to be very poor, in order to completely solve the above pain points, starting from kubernets version 1.18, the built-in Sidecar function of K8s will ensure that sidecar starts and runs before the normal business process starts, that is, by changing the startup life cycle of pod, starting the sidecar container after the init container is completed, and starting the business container after the sidecar container is ready to ensure the sequence of the startup process. In the Pod termination phase, SIGTERM signals are sent to the sidecar container only when all ordinary containers have reached the termination state.

5. Ingress Gateway and Service port linkage

A common reason why the Ingress Gateway rule does not take effect is that the listening port of Gateway is not open on the corresponding K8s Service. First of all, we need to understand the relationship between Istio Ingress Gateway and K8s Service:

In the figure above, although gateway defines expected control ports b and c, its corresponding service (via Tencent Cloud CLB) only opens ports an and b, so the inbound traffic from port b of LB can eventually be controlled by istio gateway.

There is no direct correlation between Istio Gateway and K8s Service. Both of them bind pod through selector to achieve indirect correlation.

Istio CRD Gateway only sends the user flow control rules to the grid edge nodes, and the traffic still needs to be controlled by LB to enter the grid.

Tencent Cloud tke mesh implements the dynamic linkage of Port in the definition of Gateway-Service, allowing users to focus on the configuration in the grid.

6. VirtualService scope

VirtualService contains most of the traffic rules on the outbound side, which can be applied not only to the data surface proxy within the grid, but also to the proxy at the edge of the grid.

The property gateways of VirtualService is used to specify the effective scope of VirtualService:

If VirtualService.gateways is empty, istio assigns it the default value mesh, which means that the effective scope is within the grid

If you want VirtualService to be applied to a specific edge gateway, you need to display and assign a value to it: gateway-name1,gateway-name2...

If you want the VirtualService to be applied to both the grid interior and the edge gateway, you need to explicitly add the mesh value to the VirtualService.gateways, such as mesh,gateway-name1,gateway-name2... A common problem is the third case above. VirtualService initially works inside the gateway. Later, to extend its rules to edge gateways, users will only add specific gateway name and omit mesh:

Istio automatically sets default values for VirtualService.gateways, which is intended to simplify user configuration, but it often leads to improper application by users, and a feature will be used as bug accidentally.

7. VirtualService does not support host fragment exception cases

When you add or modify a VirtualService to a host, it is found that the rules always fail to take effect. Other VirtualService has also applied other rules to the host, and the rules may not conflict, but some of these rules may not take effect.

Background

Rules in VirtualService, aggregating according to host

As the business grows, the content of VirtualService will grow rapidly, and the flow control rules of an host may be maintained by different teams. If security rules and business rules are separated, different businesses will be separated according to sub-path.

Current istio support for cross-resource VirtualService:

At the edge of the grid (gateway), the flow control rules of the same host are supported to be distributed to multiple VirtualService objects, and the istio is automatically aggregated, but depends on the definition order and users to avoid conflicts.

Within the grid (for sidecar), the flow control rules of the same host cannot be distributed to multiple VirtualService objects. If there are multiple VirtualService in the same host, only the first VirtualService takes effect and there is no conflict detection.

VirtualService can not support host rule fragmentation very well, so that the maintenance responsibilities of the team can not be well decoupled. The configuration staff need to know all the flow control rules of the target host before they have the confidence to modify the VirtualService.

Istio solution: Virtual Service chaining (plan in 1.6)

Istio plans to support the Virtual Service proxy chain in 1.6:

Virtual Service supports sharding definition + proxy chain

Support the team to segment the Virtual Service of the same host flexibly, for example, separate according to SecOps/Netops/Business characteristics, and each team maintains various independent Virtual Service

8. Full link tracking is not a completely transparent access exception case

After the microservice is connected to the service mesh, the link tracking data is not connected in series.

Reason

In the service mesh telemetry system, the implementation of call chain tracking is not completely zero intrusive, and requires a small amount of modification by the user's business to support it. Specifically, when the user issues (http/grpc) RPC, the B3 trace headers that exists in the upstream request needs to be actively written to the downstream RPC request header. These headers include:

It is difficult for some users to understand: since inbound traffic and outbound traffic have been completely blocked, envoy,envoy can achieve complete traffic control and modification, why do you need to display delivery headers?

For envoy, inbound requests and outbound requests are completely independent, and envoy cannot perceive the correlation between requests. In fact, it is entirely up to the application to decide whether these requests are related or not. To take a special business scenario, if Pod X receives request A, the triggered business logic is: send a request to Pod Y every 10 seconds, such as B1Magic B2 and B3, then what is the relationship between these fanned-out requests Bx and request A. The business may make different decisions: think of An as the parent request of Bx, or think of Bx as an independent top-level request.

9. MTLS causes connection to be broken

In the user scenario where istio mTLS is enabled, the occurrence of connection termination in access is a high-frequency exception:

The reason for this exception is related to the mTLS configuration in DestinationRule, which is a weak interface design in istio.

When global mTLS is enabled via MeshPolicy, if no other DestinationRule,mTLS is defined in the grid, it will work properly.

If DestinationRule is added to the subsequent grid, and the child version of mTLS can be overridden in DestinationRule (the default is not enabled!) When using DestinationRule, users tend to pay little attention to the mTLS attribute (left blank). Eventually, after adding DestinationRule, mTLS becomes disabled, resulting in connection termination.

To fix the above problem, the user has to add the mTLS property to all DestinationRule and set it to on

This kind of istio mtls user interface is extremely unfriendly. Although mtls is globally transparent by default and the business is not aware of the existence of mtls, once the business defines DestinationRule, you must know whether the current mtls is enabled and make adjustments. Imagine that the mtls configuration is assigned to the security team, while the business team is responsible for their own DestinationRule, and the coupling between the teams will be very serious.

10. User service snooping address limit exception description

If the address that the business process listens to in the user container is specific ip (pod ip) instead of 0.0.0.0, the user container cannot access istio properly, and traffic routing fails. This is another scenario that challenges Istio's maximum transparency (Maximize Transparency) design goals.

Cause analysis

An iptables in Istio-proxy:

Where ISTIO_IN_REDIRECT is virtualInbound, port 15006 is alternate red is virtualOutbound, and port 15001.

The key point is rule 2: if the destination is not 127.0.0.1 envoy 32, it is transferred to 15006 (virtualInbound, envoy snooping), which causes the traffic to pod ip to always return to envoy.

Interpretation of the rule:

# Redirect app calls back to itself via Envoy when using the service VIP or endpoint # address, e.g. AppN = > Envoy (client) = > Envoy (server) = > appN.

This rule is intended to work here: assuming that the current Pod a belongs to service A, the user container in Pod accesses service A through the service name, and the load balancing logic in envoy forwards this access to the current pod ip. Istio hopes that the server still has traffic control capability in this scenario. As shown in the figure:

Reform suggestion

It is recommended to adjust the service listening address and use 0.0.0.0 instead of the specific IP before connecting to the istio. If the business side thinks that the transformation is difficult, you can refer to a solution shared earlier: service monitoring pod ip routing exception analysis in istio

Thank you for your reading. The above is the content of "what are the 10 common anomalies in istio". After the study of this article, I believe you have a deeper understanding of what the 10 common anomalies in istio are, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.