What is the fault analysis and optimization of Consul 04/29 Update SLTechnology News&Howtos

What is the fault analysis and optimization of Consul

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

What is the fault analysis and optimization of Consul? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Registry background and use of Consul

From the point of view of the micro-service platform, we hope to provide a unified service registry, so that any business and team only need to negotiate the service name as long as they use this infrastructure; they also need to support business to do multi-DC deployment and failover. Due to the good design of scalability and multi-DC support, we chose Consul and adopted the architecture recommended by Consul. A single DC contains the WAN mode between Consul Server and Consul Agent,DC and is equivalent to each other. The structure is shown in the following figure:

Note: only four DC are shown in the figure. According to the construction of the company's computer room and the access of the third-party cloud, there are more than ten DC in the actual production environment.

Integration with QAE container application platform

Iqiyi's internal container application platform QAE is integrated with Consul. Since the early development was based on the Mesos/Marathon system, there was no concept of Pod container group, and the container could not be injected into sidecar amicably, so we chose the third-party registration mode in the micro-service mode, that is, the QAE system synchronizes the registration information to Consul in real time, as shown in the following figure. In addition, the external service mode of Consul is used to avoid failures when the states of the two systems are inconsistent. For example, Consul has determined that the node or service instance is unhealthy, but QAE is not aware of it, so it will not restart or reschedule, resulting in no healthy instances available.

An example of the relationship between QAE applications and services is as follows:

Each QAE application represents a set of containers, and the mapping relationship between the application and the service is loosely coupled. Just associate it to the corresponding DC according to the actual Consul DC of the application. The subsequent status changes of the application container, such as update, expansion and failure restart, will be reflected in the registration data of the Consul in real time.

Integration with API Gateway

The micro-service platform API gateway is one of the most important users of the service registry. The gateway will deploy multiple clusters according to region, operator and other factors. Each gateway cluster will correspond to a Consul cluster according to the location of the private network, and query the nearest service instance from Consul, as shown below:

Here, we use the PreparedQuery feature of Consul to return this DC service instance first for all services. If this DC is not available, then query other DC data from near to far according to the inter-DC RTT.

Fault and analysis optimization

Consul failure

Consul has been running steadily for more than three years since it was launched at the end of 2016, but recently we have encountered a failure. We have received alarms that multiple Consul Server of a certain DC does not respond to requests, a large number of Consul Agent cannot be connected to the Server, and there is no automatic recovery. The main phenomena observed on the Server side are:

Raft protocol keeps losing elections and cannot get leader.

The HTTP&DNS query API times out a lot, and some of them are observed to return after tens of seconds (normally, they should be returned at millisecond level).

The goroutine increases linearly, and the memory rises synchronously, which eventually triggers the system OOM; to find no definite problem in the log. From monitoring metrics, it is observed that the execution time of PreparedQuery increases abnormally, as shown in the following figure:

At this time, the API gateway also failed to time out to query the service information. We switched the corresponding gateway cluster to another DC, and then restarted the Consul process to return to normal.

Malfunction analysis

After log checking, it is found that inter-DC network jitter (RTT increase, accompanied by packet loss) occurred before the failure, which lasted about 1 minute. Our preliminary analysis is that the normally received PreparedQuery request backlog cannot be quickly returned in Server due to inter-DC network jitter. As time accumulates, more and more goroutine and memory are consumed, resulting in Server exception.

Following this idea, try to reproduce in the test environment, there are 4 DC, the PreparedQuery QPS of a single Server is 1.5K, each PreparedQuery query will trigger three cross-DC queries, and then use the tc-netem tool to simulate the increase of RTT between DC, and get the following results:

When the RTT between DC changes from normal 2ms to 800ms, the goroutine and memory of Consul Server do increase linearly, and so does the execution time of PreparedQuery, as shown in the following figure:

Although goroutine and memory are growing, before OOM, other functions of Consul Server are not affected, Raft protocol works normally, and data query requests of this DC can also respond normally.

The moment RTT reverts to 2ms between DC, Consul Server loses leader, and then Raft loses elections and cannot recover.

The above operations can stably reproduce the fault, so that the analysis work has a direction. First of all, it is basically confirmed that the growth of goroutine and memory is caused by the backlog of PreparedQuery requests, which is caused by the blocking of network requests at the initial stage, and the reason for the backlog is still unknown after the network is restored, so the whole process should be in an abnormal state. Then, why did Consul fail after the network was restored? why does Raft only have network communication within DC? why is it abnormal? Is the question that puzzles us the most.

At the beginning, we focused on the Raft issue, and by tracking the community issue, we found hashicorp/raft#6852, which described the possibility of raft deadlocks in our version under high load and network jitter, which is very similar to ours. However, after updating the Raft library and Consul-related code according to issue, the failure still exists when the test environment reappears.

After that, we try to add a log to the Raft library to see the details of Raft's work. This time, we find that the log interval is 10 seconds between the Raft members entering the Candidate state and requesting the peer node to vote for themselves, while only one line of metrics update is performed in the code, as shown in the following figure:

Therefore, it is suspected that the metrics call is blocked, resulting in abnormal operation of the whole system, and then we find the relevant optimization in the release history. The lower version of armon/go-metrics uses the global lock sync.Mutex in the Prometheus implementation. All metrics updates need to acquire this lock first, while the v0.3.3 version uses sync.Map instead. Each metric, as a key of the dictionary, only needs to obtain the global lock when the key is initialized. After that, there is no lock competition when different metric updates values, and using sync.Atomic to ensure atomic operation when updating the same metric is more efficient as a whole. After updating the corresponding dependency library and repeating the network jitter, Consul Server can return to normal on its own.

This does seem to be due to metrics code blocking, resulting in an overall system exception. However, we still have doubts. The PreparedQuery QPS of the order desk Server in the recurrence environment is 1.5K, while the stable network environment still works normally when the order desk Server pressure test QPS is up to 2.8K. That is to say, under normal circumstances, the original code meets the performance requirements, and performance problems occur only in the event of a failure.

The next troubleshooting got into trouble, and after trial and error, we found an interesting phenomenon: the version compiled with go 1.9 (which is also used in the production environment) can reproduce the failure; the same code compiled with go 1.14 cannot. After a closer look, we found the following two records in go's release history:

According to the code, we found user feedback that in the go1.9~1.13 version, when a large number of goroutine compete for a sync.Mutex at the same time, there will be a sharp decline in performance, which can well explain our problem. Since the Consul code relies on the new built-in libraries in go 1.9, we cannot compile with a lower version, so we remove the sync.Mutex-related optimizations in go 1.14, as shown in the following figure, and then compile Consul with this version of go, and sure enough, we can reproduce our failure.

Reviewing the update history of the language, go 1.9 adds the fair lock feature and the starvation pattern to the original normal schema to avoid the long tail effect of lock waiting. However, in normal mode, the new goroutine has a high chance of successful lock competition, thus eliminating the handover of goroutine, and the overall efficiency is high; while in starvation mode, the new goroutine will not directly compete for the lock, but will queue itself to the end of the waiting queue, then sleep and wait for wakeup, and the lock will be assigned according to the waiting queue FIFO, and the goroutine that obtains the lock will be scheduled to execute, which will increase the cost of goroutine scheduling and handover. In go 1.14, the performance problem is improved. In starvation mode, when the goroutine performs the unlock operation, it will directly give the CPU time to the next goroutine waiting for the lock, which will speed up the execution of the locked protection part of the code as a whole.

The reason for this failure is clear. First of all, the network jitter leads to a large backlog of PreparedQuery requests in Server, as well as a large amount of goroutine and memory usage. After the network is restored, the backlog of PreparedQuery continues to execute. In our recurrence scenario, the backlog of goroutine will exceed 150K. When these goroutine are executed, they will update the metrics to obtain the global sync.Mutex. At this time, they switch to starvation mode and the performance degrades, a lot of time is waiting for sync.Mutex, and request blocking timed out. In addition to the backlog of goroutine, the new PreparedQuery is still receiving, and the lock acquisition is also blocked. As a result, the sync.Mutex remains in starvation mode and cannot be recovered automatically; on the other hand, the operation of raft code depends on timers, timeouts, and timely delivery and processing of messages between nodes, and these timeouts are usually at the level of seconds and milliseconds, but the metrics code is blocked for too long, which directly leads to the failure of timing-related logic.

Then we updated all the problems found in the production environment and upgraded to go 1.14 Consul Gommerizics v0.3.3 and hashicorp/raft v1.1.2 to make Consul reach a stable state. In addition, monitoring indicators have been sorted out and improved, and the core monitoring includes the following dimensions:

Processes: CPU, memory, goroutine, connections

Raft: membership status change, commit rate, commit time, synchronous heartbeat, synchronous delay

RPC: number of connections, requests across DC

Write load: registration & de-registration rate

Read load: number of Catalog/Health/PreparedQuery requests, execution time-consuming

Redundant registration

Based on the failure phenomena during the failure of Consul, we re-examined the architecture of the service registry.

In the Consul architecture, if a DC Consul Server fails completely, it represents the DC failure, and you have to rely on other DC for disaster recovery. But in reality, many services that are not on the critical path, and services that do not require very high SLA do not have multiple DC deployments, so if the Consul of the DC fails, then the whole service will fail.

For services that do not have a multi-DC deployment, if they can be registered in a redundant DC, other DC can be found normally in case of a single DC Consul failure. Therefore, we have modified the QAE registration relationship table. For services that only have a single DC deployment, the system automatically registers one with other DC, as shown below:

The redundant registration of QAE is equivalent to the operation of data overwriting in the upper layer. Consul itself does not synchronize service registration data between DC, so services registered directly through Consul Agent do not have a better method of redundant registration, or rely on the service itself to do a lot of DC deployment.

Guarantee API Gateway

Currently, the normal work of the API gateway depends on the local caching of Consul PreparedQuery query results. There are two problems with the current interaction mode:

The gateway cache is lazy, and the gateway will only be loaded from the Consul query when it is used for the first time. Query failure in case of Consul failure will result in request forwarding failure.

Multiple cross-DC queries may be involved in PreparedQuery, which is time-consuming and is a complex query. Because each gateway node needs to build a separate cache and has TTL in the cache, the same PreparedQuery query will be executed many times, and the query QPS will increase linearly with the size of the gateway cluster.

To improve the stability and efficiency of gateway query Consul, we choose to deploy a separate Consul cluster for each gateway cluster, as shown in the following figure:

The red one in the figure is the original Consul cluster, and the green one is the Consul cluster deployed separately for the gateway, which only works inside a single DC. We have developed the Gateway-Consul-Sync component, which periodically reads the PreparedQuery query results of the service from the public Consul cluster, and then writes them to the green Consul cluster, while the gateway accesses the green Consul directly to query the data. After this transformation, there are the following advantages:

From the point of view of supporting gateways, the load of the public cluster originally increases linearly with the number of network joints, but after the transformation, it increases linearly with the number of services, and a single service will only execute PreparedQuery queries once in the synchronization cycle, so the overall load will be reduced.

The green Consul in the figure is only used by the gateway. When the PreparedQuery is executed, all data is local and does not involve cross-DC queries, so the complexity is reduced, it is not affected by the cross-DC network, and the overall read and write load of the cluster is more controllable and stable.

When the public cluster fails, the Gateway-Consul-Sync does not work properly, but the green Consul can still return the previously synchronized data, and the gateway can continue to work

Since the interface and data format of the gateway query Consul are completely consistent before and after the transformation, when the green Consul cluster in the figure fails, it can be switched back to the public Consul cluster as an alternative.

The answer to the question about Consul fault analysis and optimization is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.