Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to solve the problems in the use of Istio

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

Today, the editor will share with you the relevant knowledge points about how to solve the problems in the use of Istio. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article.

Failed Eureka heartbeat notification

In the previous article, we introduced the difference between Headless Service and normal Service. Because of the particularity of Headless Service, the configuration parameters of this kind of service are different from those of other services in the configuration sent to Envoy Sidecar by Istio. In addition to the mTLS failure we encountered last time, these differences may lead to some other unexpected situations in the application.

The problem encountered this time is that after the Spring Cloud application is migrated to Istio, the service provider fails to send a heartbeat to Eureka Server.

Note: Eureka Server uses heartbeat mechanism to determine the health status of the service. After startup, the service provider periodically (default is 30 seconds) sends a heartbeat to Eureka Server to prove that the current service is available. If Eureka Server does not receive the heartbeat of the client within a certain period of time (default is 90 seconds), the service is considered to be down and the instance is logged out.

Looking at the application log, you can see the log information about the heartbeat failure sent by the Eureka client.

2020-09-24 13 com.netflix.discovery.DiscoveryClient 32 ERROR 46.533 ERROR 1-[tbeatExecutor-0] com.netflix.discovery.DiscoveryClient: DiscoveryClient_EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client-was unable to send heartbeatable com.netflix.discovery.shared.transport.TransportException: Cannot execute request on any known server at com.netflix.discovery.shared.transport.decorator.RetryableEurekaHttpClient.execute (RetryableEurekaHttpClient.java:112) ~ [eureka-client-1.9.13.jar! /: 1.9.13] at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.sendHeartBeat (EurekaHttpClientDecorator.java:89) ~ [eurekalle ClientLay 1.9.13.jarring plaza 1.9.13] at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator$3.execute (EurekaHttpClientDecorator.java:92) ~ [eurekalle cliently1.9.13.jarlux lux 1.9.13] at com.netflix.discovery.shared.transport .decorator.SessionedEureka HttpClient.execute (SessionedEurekaHttpClient.java:77) ~ [eurekaMaicliently1.9.13.jarring pedigree 1.9.13] at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.sendHeartBeat (EurekaHttpClientDecorator.java:89) ~ [eurekaMethcliently1.9.13.jarlemageme 1.9.13] at com.netflix.discovery.DiscoveryClient.renew (DiscoveryClient.java:864) ~ [eurekaMaicliently1.9.13.jarforth / : 1.9.13] at com.netflix.discovery.DiscoveryClient$HeartbeatThread.run (DiscoveryClient.java:1423) ~ [eurekaMurClientMutual 1.9.13.jarring hand hand at java.base/java.util.concurrent.Executors$RunnableAdapter.call 1.9.13] at java.base/java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515) ~ [na:na] at java.base/java.util.concurrent.FutureTask.run (FutureTask.java:264) ~ [na:na] at java.base / java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1130) ~ [na:na] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:630) ~ [na:na] at java.base/java.lang.Thread.run (Thread.java:832) ~ [na:na] expired IP address

For the failure of the request failure class, we can first view the cause of the failure through the access log of Envoy. View the log of the client Envoy Sidecar with the following command:

K logs-f eureka-client-66f748f84f-vvvmz-c eureka-client- n eureka

From the Envoy log, you can see the heartbeat request made by the client to the server through HTTP PUT. The Response status code of the request is "UF,URX", indicating its Upstream Failure, that is, failed to connect to the upstream service. You can also see in the log that after the connection failed, Envoy returned a "503" HTTP error code to the client application.

[2020-09-24T13:31:37.980Z] "PUT / eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1" 503 UF URX "-" 091 3037-"-"Java-EurekaClient/v1.9.13"1cd54507-3f93-4ff3-a93e-35ead11da70f"eureka-server:8761"172.16.0.198REV 8761" outbound | 8761 | | eureka-server.eureka.svc.cluster.local-172.16.0.198Vera 8761 172.16.0.169-default

You can see from the log that the Upstream Cluster accessed is outbound | 8761 | | eureka-server.eureka.svc.cluster.local, and Envoy forwards the request to Upstream Host with IP address 172.16.0.198.

Looking at the services deployed in the cluster, you can see that eureka-server is a Headless Service.

HUABINGZHAO-MB0:eureka-istio-test huabingzhao$ k get svc-n eureka- o wideNAME TYPE CLUSTER-IP EXTERNAL-IP PORT (S) AGE SELECTOReureka-server ClusterIP None 8761/TCP 17m app=eureka-server

In the previous article in this series, "Istio Operations and maintenance practice Series (2): headless Service", we learned that Headless Service does not have Cluster IP,DNS to resolve Service names directly to multiple Pod IP at the back end of Service. The Envoy log shows that the connection to the Eureka Server address 172.16.0.198 failed. Let's see which Eureka Server Pod this IP comes from.

HUABINGZHAO-MB0:eureka-istio-test huabingzhao$ k get pod-n eureka- o wide | grep eureka-serverNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESeureka-server-0 1 + 1 Running 0 6h65m 172.16.0.59 10.0.0.15 eureka-server-1 1/1 Running 0 6m1s 172.16.0.200 10.0.0.7 eureka-server-2 1/1 Running 0 6h66m 172.16.1.3 10.0.0.14

From the command output above, you can see that there are three servers in the Eureka cluster, but none of the servers' Pod IP is the 172.16.0.198 shown in the Envoy log. Further analysis shows that the startup time of eureka-server-1 Pod is much later than that of the client. It is suspected that Envoy uses an IP of a destroyed Eureka Server for access, resulting in access failure.

My suspicion is further deepened by looking at the relevant configuration of outbound | 8761 | | eureka-server.eureka.svc.cluster.local in the Envoy dump file. You can see from the yaml snippet below that the Cluster is of type "ORIGINAL_DST".

{"version_info": "2020-09-23T03:57:03Z/27", "cluster": {"@ type": "type.googleapis.com/envoy.api.v2.Cluster", "name": "outbound | 8761 | | eureka-server.eureka.svc.cluster.local", "type": "ORIGINAL_DST", # this option indicates that Enovy will directly use the address in the original downstream request when forwarding the request. "connect_timeout": "1s", "lb_policy": "CLUSTER_PROVIDED",...}

According to Envoy's documentation, "ORIGINAL_DST" is interpreted as:

In these cases requests routed to an original destination cluster are forwarded to upstream hosts as addressed by the redirection metadata, without any explicit host configuration or upstream host discovery.

That is, the Cluster,Envoy of "ORIGINAL_DST" type will directly use the original destination IP address in the downstream request when forwarding the request, instead of using the service discovery mechanism. The processing method of Envoy Sidecar in Istio is similar to that of K8s on Headless Service, that is, the client directly selects a backend Pod IP based on DNS and does not use load balancing algorithm to redirect client requests. But what makes people wonder is: why did the client fail to access the Pod address 172.16.0.198 obtained by DNS query? This is because the address obtained by the client when querying DNS no longer exists during access. The following figure explains the cause of the problem:

Client queries DNS to get three IP addresses of eureka-server.

Client selects IP 172.16.0.198 of Server-1 to initiate a connection request, which is intercepted by iptables rules and redirected to VirtualInbound port 15001 of Envoy in the client Pod.

After receiving the connection request from Client, according to the configuration of Cluster, Envoy uses the original destination address 172.16.0.198 in the request to connect to Server-1. At this time, the corresponding Pod of the IP exists, so the link from Envoy to Server-1 is created successfully, and the link between Client and Envoy will be established successfully. Client uses the HTTP Keep Alive option when creating a link, so Client maintains the link and continuously sends HTTP PUT service heartbeat notifications at 30-second intervals through the link.

Due to some reasons, the Server-1 Pod was reconstructed as Server-1 Server-1 Pod by K8s, and the IP has changed.

When the IP of the Server-1 changes, the Envoy does not actively disconnect from the Client side immediately. From Client's point of view, the TCP link to 172.16.0.198 is still normal, so Client will continue to use this link to send HTTP requests. And because the Cluster type is "ORIGINAL_DST", Envoy continues to try to connect to the original destination address 172.16.0.198 in the Client request, as shown by the blue arrow in the figure. However, because the Pod on the IP has been destroyed, the Envoy fails to connect and returns an error message such as "upstream connect error or disconnect/reset before headers. Reset reason: connection failure HTTP/1.1 503" to the Client side after the failure. If Client does not break and rebuild the link immediately after receiving the error, Client will not re-query DNS for the correct address after Pod reconstruction until the link times out.

Enable EDS for Headless Service

From the previous analysis, we already know that the cause of the error is due to the expiration of the IP address in the client HTTP long link. Then one of the most direct ideas is to let Envoy use the correct IP address to connect to Upstream Host. How can this be achieved without modifying the client code or rebuilding the client link?

If you compare the Cluster configuration of another service, you can see that normally, in the configuration issued by Istio, the Cluster type is EDS (Endopoint Discovery Service), as shown in the following yaml snippet:

{"version_info": "2020-09-23T03:02:01Z/2", "cluster": {"@ type": "type.googleapis.com/envoy.config.cluster.v3.Cluster", "name": "outbound | 8080 | | http-server.default.svc.cluster.local", "type": "EDS", # General service adopts EDS service discovery According to the LB algorithm, select one of the endpoint issued by EDS to connect "eds_cluster_config": {"eds_config": {"ads": {}, "resource_api_version": "V3"}, "service_name": "outbound | 8080 | | http-server.default.svc.cluster.local"},.}

In the case of EDS, Envoy will get all the available Endpoint in the Cluster through EDS and send the request from Downstream to different Endpoint according to the load balancing algorithm (default is Round Robin). So as long as you change the Cluster type to EDS,Envoy, when forwarding the request, you will no longer use the wrong original IP address in the request, but will use the Endpoint address that EDS automatically discovered. In the case of EDS, the access flow in this example is as follows:

By looking at the Istio source code, we can see that Istio defaults to "ORIGINAL_DST" type Cluster for Headless Service, but we can also force EDS to be enabled by setting an Istiod environment variable PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES to Headless Service.

Func convertResolution (proxy * model.Proxy, service * model.Service) cluster.Cluster_DiscoveryType {switch service.Resolution {case model.ClientSideLB: return cluster.Cluster_EDS case model.DNSLB: return cluster.Cluster_STRICT_DNS case model.Passthrough: / / Headless Service is model.Passthrough if proxy.Type = = model.SidecarProxy {/ / for SidecarProxy Enable EDS if the value of PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES is set to True Otherwise, use ORIGINAL_DST if service.Attributes.ServiceRegistry = = string (serviceregistry.Kubernetes) & & features.EnableEDSForHeadless {return cluster.Cluster_EDS} return cluster.Cluster_ORIGINAL_DST} return cluster.Cluster_EDS default : return cluster.Cluster_EDS}}

After setting the Istiod environment variable PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES to true, and then looking at the log of Envoy, you can see that although the original IP address of the request is still 172.16.0.198, Envoy has distributed the request to the IP of the three actually available Server.

[2020-09-24T13:35:28.790Z] "PUT / eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1" 200-"00 44"-"Java-EurekaClient/v1.9.13", "d98fd3ab-778d-42d4-a361-d27c2491eff0", "eureka-server:8761", "172.16.1.3 24T13:35:28.790Z" outbound | 8761 | | eureka-server.eureka.svc.cluster .local 172.16.0.169 eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1 39934 172.16.0.198local 172.16.0.169 default [2020-09-24T13:35:58.797Z] "PUT / eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1" 200-"00 11"-"Java-EurekaClient/v1.9.13"7799a9a0-06a6-44bcMur99f1 -" A928d8576b7c "" eureka-server:8761 "" 172.16.0.59 outbound 8761 "outbound | 8761 | eureka-server.eureka.svc.cluster.local 172.16.0.169 outbound 45582 172.16.0.198861 172.16.0.169 default [2020-09-24T13:36:28.801Z]" PUT / eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1 "200 -"-" -"00 21"-"Java-EurekaClient/v1.9.13", "aefb383f-a86d-4c96-845c-99d6927c722e", "eureka-server:8761"172.16.0.200 outbound 8761" outbound | 8761 | | eureka-server.eureka.svc.cluster.local 172.16.0.169 eureka-server:8761 60794 172.16.0.198VOUR 8761 172.16.0.169Frev 53890-Service of mysterious disappearance of default

After changing the type of Eureka Server Cluster from ORIGINAL_DST to EDS, the service with previous heartbeat failure is normal. However, after a period of time, it was found that some of the services registered in Eureka were offline, resulting in normal access between services. Query the log of Eureka Server and find the following error:

2020-09-24 14 c.netflix.eureka.cluster.PeerEurekaNode 0714 WARN 35.511 WARN 6-[eureka-server-3] c.netflix.eureka.cluster.PeerEurekaNode: EUREKA-SERVER-2/eureka-server-2.eureka-server.eureka.svc.cluster.local:eureka-server-2:8761:Heartbeat@eureka-server-0.eureka-server: missing entry.2020-09-24 1414 Freud 0711 WARN 35.511-[eureka-server-3] c.netflix.eureka.cluster.PeerEurekaNode: EUREKA -SERVER-2/eureka-server-2.eureka-server.eureka.svc.cluster.local:eureka-server-2:8761:Heartbeat@eureka-server-0.eureka-server: cannot find instance

We can see from the log that an error occurred in the data synchronization between multiple Eureka Server. When deployed in cluster mode, data synchronization occurs among multiple instances in the Eureka cluster. In this case, there are three instances in the Eureka cluster. The data synchronization between these instances is shown in the figure below:

After switching to EDS, when each Eureka Server in the cluster initiates data synchronization to other Eureka Server in the cluster, the Envoy Sidecar in the requested party Pod is randomly distributed using Round Robin, resulting in a disorder of synchronization messages and inconsistent service registration messages in each server in the cluster, resulting in some services being misjudged offline. The fault phenomenon is relatively random. After many tests, we find that the failure is more likely to occur when there are more services registered in Eureka, and it is not easy to repeat when there are only a small number of services.

Once the cause is found, it is easy to solve the problem, which can be avoided by setting the Sidecar Injection of Eureka Server to false, as shown in the following yaml snippet:

ApiVersion: apps/v1kind: StatefulSetmetadata: name: eureka-serverspec: selector: matchLabels: app: eureka-server serviceName: "eureka-server" replicas: 3 template: metadata: labels: app: eureka-server annotations: sidecar.istio.io/inject: "false" # does not inject Envoy Siedecar spec: containers:-name: eureka-server image: zhaohuabing/eureka into eureka-server pod -test-service:latest ports:-containerPort: 8761 name: http reflection

For Cluster with "ORIGINAL_DST" type by default for Headless Service,Istio, it is reasonable to require Envoy Sidecar to request the original destination IP address when forwarding. As we introduced in our previous article in this series, "Istio Operations Series (2): headless Services", Headless Service is generally used to define stateful services. For stateful services, it is up to the client to decide which backend Pod to access according to the application-specific algorithm, so a load balancer should not be added in front of these Pod.

In this example, because the received client service heartbeat notifications are synchronized among the nodes in the Eureka cluster, it is not important for the client to send the heartbeat notification to which Eureka node, we can assume that the Eureka cluster is stateless for external clients. Therefore, there is no problem to set the PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES environment variable to load balance the requests sent by the client to Eureka Server in the client's Envoy Sidecar. However, because the nodes in the Eureka cluster are stateful, the modification affects the data synchronization among the Eureka nodes in the cluster, which leads to the problem of wrong offline service in the later part. For this problem, we circumvent it by removing the Sidecar injection from Eureka Server.

For this problem, a more reasonable way is for Envoy Sidecar to actively break the link with the client side after a certain number of failed attempts to connect to the Upstream Host, and the client will re-query the DNS to get the correct Pod IP to create a new link. After testing and verification, in Istio 1.6 and later versions, Envoy will actively break the long link with Downstream after the Upstream link is broken. It is recommended to upgrade to version 1.6 as soon as possible to completely solve this problem. You can also directly use TCM (Tencent Cloud Mesh), the cloud native Service Mesh service on Tencent Cloud, to quickly introduce the traffic management and service management capabilities of Service Mesh for micro-service applications, without paying attention to the installation, maintenance and upgrade of Service Mesh infrastructure.

These are all the contents of this article entitled "how to solve the problems in the use of Istio". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report