How to solve the problem of startup dependence of application container on Envoy Sidecar 07/02 Update SLTechnology News&Howtos

How to solve the problem of startup dependence of application container on Envoy Sidecar

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to solve the startup dependence of the application container on Envoy Sidecar. Many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something from this article.

Here are some of the experiences of users when migrating from traditional micro-services frameworks such as Spring Cloud,Dubbo to Istio services grid, as well as solutions to some common problems that may be encountered in the process of using Istio.

Fault phenomenon

The manifestation of this problem is that applications installed with sidecar proxy cannot access other services outside of pod, such as external HTTP,MySQL,Redis, for a short period of time after startup. If the application does not handle the exception that depends on the service, this problem will often cause the application to fail to start. Below we take the analysis process of a typical fault caused by this problem as an example to explain the cause of the problem.

Typical case: feedback from an operation and maintenance classmate: last night, the heartbeat test applied in the Istio environment was reported to connect reset, and then the service was restarted. It is suspected that the network instability in the Istio environment caused the service restart.

Malfunction analysis

According to the feedback of the operation and maintenance students, the pod has been restarted many times. So we first use the kubectl logs-- previous command to query the log before the last restart of the awesome-app container to find out the reason for its restart.

Kubectl logs-- previous awesome-app-cd1234567-gzgwg-c awesome-app

The last error message before restart is queried from the log as follows:

Logging system failed to initialize using configuration from 'http://log-config-server:12345/******/logback-spring.xml'java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect (Native Method) at java.net.AbstractPlainSocketImpl.doConnect (AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress (AbstractPlainSocketImpl.java:206)

From the error message, we can know that when the application process starts, it tries to pull the configuration information of logback from the configuration center through the HTTP protocol, but the operation fails due to the network exception, which leads to the failure of the application process and finally leads to the restart of the container.

What caused the network anomaly? Let's use the Kubectl get pod command to query the running status of Pod to try to find more clues:

Kubectl get pod awesome-app-cd1234567-gzgwg-oyaml

The details of the pod output from the command are as follows, and the yaml fragment omits other extraneous details and shows only the container state information for the lastState and state parts.

ContainerStatuses:-containerID: lastState: terminated: containerID: exitCode: 1 finishedAt: 2020-09-01T13:16:23Z reason: Error startedAt: 2020-09-01T13:16:22Z name: awesome-app ready: true restartCount: 2 state: running: startedAt: 2020-09-01T13:16:36Z-containerID: lastState: {} name: istio-proxy ready: True restartCount: 0 state: running: startedAt: 2020-09-01T13:16:20Z hostIP: 10.0.6.161

From this output, you can see that the application container awesome-app in pod has been restarted twice. Sort out the timing of the startup and termination of the awesome-app application container and the istio-proxy sidecar container in the pod, and you can get the following timeline:

2020-09-01T13:16:20Z istio-proxy starts

2020-09-Last startup time of 01T13:16:22Z awesome-app

2020-09-01T13:16:23Z awesome-app Last abnormal exit time

2020-09-01T13:16:36Z awesome-app started for the last time and has been running normally ever since

You can see that 2 seconds after istio-proxy starts, awesome-app starts and exits abnormally 1 second later. Combined with the previous log information, we know that the direct cause of this startup failure is the failure of the application to access the configuration center. Sixteen seconds after istio-proxy starts, awesome-app starts again, this time successfully, and has been running normally ever since.

The interval between istio-proxy startup and the last abnormal exit of awesome-app is very short, only 2 seconds, so we can basically judge that the initialization of istio-proxy has not been completed at this time, resulting in awesome-app cannot connect to the external service through istio-proxy, resulting in its startup failure. When awesome-app starts again in 2020-09-01T13:16:36Z, because istio-proxy has been started for a long time and completed the process of obtaining dynamic configuration from pilot, the network access of awesome-app to the outside of pod is normal.

As shown in the figure below, after Envoy starts, it requests service and routing configuration information from pilot through xDS protocol. After receiving the request, Pilot assembles configuration information based on the node (pod or VM) where Envoy resides, including Listener, Route, Cluster, etc., and then sends it to Envoy through xDS protocol. Depending on the size of the Mesh and network conditions, the distribution process of this configuration takes several seconds to tens of seconds. Since the initialization container has created Iptables rule rules in pod, the network traffic sent by the application will be redirected to Envoy during this period of time. At this time, there are no listeners and routing rules in Envoy to handle these network requests, which can not be processed, resulting in the failure of network requests. (for more information about Envoy sidecar initialization process and Istio traffic management principles, you can refer to this article for in-depth analysis of Istio traffic management implementation mechanism.)

The solution determines the initialization status of Envoy in the application startup command

From the previous analysis, we can see that the root cause of this problem is due to the application process's dependence on Envoy sidecar configuration initialization. Therefore, the most direct solution is to judge the initialization state of Envoy sidecar when the application process starts, and then start the application process after its initialization is completed.

The health check API localhost:15020/healthz/ready of Envoy will not return 200until the xDS configuration initialization is completed, otherwise it will return 503.Therefore, you can judge the configuration initialization status of Envoy based on this API, and then start the application container after it is completed. We can add a script to invoke the Envoy health check in the startup command of the application container, as shown in the following configuration snippet. When used in other applications, change start-awesome-app-cmd to the application startup command in the container.

ApiVersion: apps/v1kind: Deploymentmetadata: name: awesome-app-deploymentspec: selector: matchLabels: app: awesome-app replicas: 1 template: metadata: labels: app: awesome-app spec: containers:-name: awesome-app image: awesome-app ports:-containerPort: 80 command: ["/ bin/bash" "- c"] args: ["while [\" $(curl-s-o / dev/null-w'% {http_code} 'localhost:15020/healthz/ready)\ "! =' 200']] Do echo Waiting for Sidecar;sleep 1; done; echo Sidecar available; start-awesome-app-cmd "]

The order in which the process is executed is as follows:

Kubernetes starts the application container.

The Envoy sidcar status is queried through curl get localhost:15020/healthz/ready in the application container startup script, and since Envoy sidecar is not ready at this time, the script will continue to retry.

Kubernetes starts Envoy sidecar.

Envoy sidecar connects to Pilot through xDS for configuration initialization.

The startup script of the application container determines that its initialization has been completed through the health check interface of Envoy sidecar, and starts the application process.

Although this scheme can avoid the problem of dependence on sequence, it needs to modify the startup script of the application container to judge the health status of Envoy. A better solution would be that the application is not aware of Envoy sidecar.

Control through pod container startup sequence

By reading the Kubernetes source code, we can see that when there are multiple containers in pod, Kubernetes starts them in turn in a thread, as shown in the following code snippet:

/ / Step 7: start containers in podContainerChanges.ContainersToStart. For _, idx: = range podContainerChanges.ContainersToStart {start ("container", containerStartSpec (& pod.Spec.Containers [IDX]))}

So we can put the Envoy sidecar in front of the application container when injecting Envoy sidecar into the pod, so that the Kubernetes will start the Envoy sidecar first and then the application container. However, there is another problem, we cannot start the application container immediately after Envoy starts, and we still need to wait for the initialization of xDS configuration to complete. At this point, we can use the postStart lifecycle hook of the container to achieve this goal. Kubernetes calls the postStart hook,postStart hook of the container after it starts, blocking the startup of the next container in the pod until the postStart hook execution is complete. Therefore, if you judge the initialization status of the Envoy configuration in the postStart hook of Envoy sidecar, and return after the initialization is completed, you can guarantee that Kubernetes will start the application container after the initialization of the Envoy sidecar configuration is completed. The order in which the process is executed is as follows:

Kubernetes starts Envoy sidecar.

Kubernetes executes postStart hook.

PostStart hook determines the initialization status of its configuration through the Envoy health check interface until Envoy startup is complete.

Kubernetes starts the application container.

Istio has incorporated this fix in 1. 7, see Allow users to delay application start until proxy is ready # 24737.

The pod spec after inserting sidecar is shown in the yaml snippet below. The pilot-agent wait command configured by postStart hook continues to call Envoy's health check interface'/ healthz/ready' to check its status until Envoy finishes initializing the configuration. More details about this solution are introduced in this article Delaying application start until sidecar is ready.

ApiVersion: v1kind: Podmetadata: name: sidecar-starts-firstspec: containers:-name: istio-proxy image: lifecycle: postStart: exec: command:-pilot-agent-wait-name: application image: my-application

This scheme perfectly solves the dependency problem of application container and Envoy sidecar initialization without modifying the application. However, the solution has two implicit dependency conditions on Kubernetes: Kubernetes starts multiple containers in pod in a defined order in a thread, and the postStart hook execution of the previous container completes and then starts the next container. These two prerequisites are satisfied in the current Kuberenetes code implementation, but because this is not the API specification of Kubernetes, this premise is likely to be broken after Kubernetes upgrades in the future, causing the problem to occur again.

Kubernetes supports defining dependencies between containers in pod

In order to solve this problem completely and prevent it from happening again after Kubernetes code changes, it would be more reasonable for Kubernetes support to explicitly define that the startup of one container in pod depends on the health state of another container. Currently, there is an issue Support startup dependencies between containers on the same Pod # 65502 in Kubernetes that tracks this issue. If Kubernetes supports this feature, the order in which the process is executed is as follows:

Kubernetes starts the Envoy sidecar container.

Kubernetes checks its status through the readiness probe of the Envoy sidecar container until readiness probe reports that Envoy sidecar has been ready, that is, it has been initialized.

Kubernetes starts the application container.

Decouple startup dependencies between application services

The idea of the above solutions is to control the startup sequence of containers in pod, and then start the application container after Envoy sidecar initialization, so as to ensure that other services can be accessed normally through the network when the application container starts. However, these proposals are only "stop-gap measures", and they are ways to cure the symptoms but not the root causes. Because even if there is no problem with external network access in pod, other services that the application container depends on may not be able to provide services properly at this time because they have not been started or because of some problems. To solve this problem completely, we need to decouple the startup dependency between application services, so that the startup of the application container is no longer strongly dependent on other services.

In a micro-service system, each service module in the original single application is divided into multiple independent processes (services). The startup order of these services is random, and the services communicate with each other over an unreliable network. The specific nature of micro-service multi-process deployment and cross-process network communication determines that anomalies in calls between services are a common situation. In order to cope with this feature of microservices, one of the basic design principles of microservices is "design for failure", that is, the need to deal with possible anomalies in an elegant manner. When a dependent external service cannot be accessed in the micro-service process, it is necessary to deal with the exception through retry, degradation, timeout, open circuit and other strategies to ensure the normal operation of the system as much as possible.

The temporary inaccessibility of the network during the initialization of Envoy sidecar only magnifies the problem that the microservice system fails to deal with service dependencies correctly, even if the dependency order of Envoy sidecar is solved, the problem still exists. For example, in this case, the configuration center is also a separate micro-service, and when a micro-service that depends on the configuration center starts, the configuration center may not be started or initialized. In this case, if the exception is not handled in the code, it will also cause the microservice that depends on the configuration center to fail to start. In a more complex system, there may be mesh dependencies among multiple micro-service processes. If micro-services are not fault-tolerant according to the principle of "design for failure", it will be a great challenge to just start the whole system. For this example, you can use a simple fault-tolerant strategy like this: start the application process with a default logback configuration, retry the configuration center after startup, and then use the configuration issued by the configuration center to set up logback after connecting to the configuration center.

A typical manifestation of the application container's dependence on Envoy Sidecar startup is that the application container fails to invoke external services within a short period of time when it is just started. The reason is that Envoy sidecar has not finished initializing the xDS configuration at this time, so the network request cannot be forwarded for the application container. The failure of the call may cause the application container not to start properly. The root cause of this problem is that there is no reasonable fault tolerance for the failure of calling dependent services in micro-service applications. For legacy systems, in order to avoid the impact on the application as much as possible, we can alleviate this problem by judging the initialization state of Envoy in the application startup command, or by upgrading to Istio 1.7. However, in order to thoroughly solve the errors caused by service dependence, it is suggested to refer to the design principles of "design for failure" to decouple the strong dependency between micro-services, and to deal with it through strategies such as retry, degradation, timeout and circuit breakage when a dependent external service cannot be accessed temporarily, so as to ensure the normal operation of the system as much as possible.

After reading the above, do you have any further understanding of how to solve the startup dependence of the application container on Envoy Sidecar? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.