What is the method of stability and performance optimization of 10,000-level K8s cluster 10/14 Update SLTechnology News&Howtos

What is the method of stability and performance optimization of 10,000-level K8s cluster

2025-10-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what is the method of stability and performance optimization of ten thousand-level K8s cluster". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations! I hope you can read it carefully and be able to achieve something!

Under the scale of ten thousand K8s clusters, how can we efficiently ensure the stability of etcd clusters?

Where does the stability risk of etcd cluster come from?

We analyze the stability risk model based on business scenarios, historical problems and existing network operation experience. The risks mainly come from unreasonable design of old TKE etcd architecture, stability of etcd, failure to meet business requirements in some scenarios of etcd performance, insufficient coverage of test cases, lax change management, comprehensive monitoring, automatic inspection of hidden trouble points, and guarantee of extreme disaster failure data security.

The etcd platform described above has solved the etcd scalability, operation and maintainability, observability and data security of all kinds of container services we manage in terms of architecture design, change management, monitoring and inspection, data migration and backup. Therefore, this article will focus on describing the etcd kernel stability and performance challenges we face in the 10,000-level K8s scenario, such as:

Data inconsistency

Memory leak

Deadlock

Process Crash

Large packet requests lead to etcd OOM and packet loss

Slow startup in scenarios with large data volume

The performance of the recording interface for authentication and querying the number of key and the specified number of queries is poor.

This article will briefly describe how we discover, analyze, reproduce and solve the above problems and challenges, as well as what experiences and lessons we have gained from the above process, and apply them to our storage stability guarantee of various container services.

At the same time, we have contributed all the solutions to the etcd open source community, and so far all of our 30 + pr contributions have been merged into the community. Tencent Cloud TKE etcd team is one of the most active contribution teams of etcd community in the first half of 2020, making a contribution to the development of etcd. In the process, we are particularly grateful for the support and help of AWS, Google, Ali and other maintainer in the community.

Case Analysis of Stability Optimization

From the GitLab erroneous deletion of part of the data lost in the main database to the 24-hour interruption caused by inconsistent GitHub data, to the hours of failure of the AWS S3, which is known as the "unsinkable aircraft carrier", all are storage services without exception. Stability is very important to the reputation of a storage service or even a company, which determines the life and death of a product. In the case of stability optimization, we will explain how we discover, analyze, reproduce and solve the above case from the seriousness of data inconsistencies, bug of two etcd data inconsistencies, lease memory leaks, mvcc deadlocks and wal crash, and share the gains and reflections we have gained from each case, so as to prevent problems before they occur.

Data inconsistency (Data Inconsistency)

When it comes to major failures caused by data inconsistencies, we have to mention in detail the disconnection between the East Coast network center of the United States and the main data center of the East Coast caused by routine maintenance of network equipment by GitHub in 18 years. Although network connectivity was restored within 43 seconds, the brief outage triggered a series of events that eventually degraded GitHub's 24-hour and 11-minute service, leaving some features unavailable.

GitHub uses a large number of MySQL clusters to store GitHub's meta data, such as issue, pr, page, and so on, as well as cross-city disaster recovery along the east and west coasts. The core cause of the failure is that when the network is abnormal, GitHub's MySQL arbitration service Orchestrator fails over and directs writes to the MySQL cluster on the west coast of the United States (the primary was on the east coast before the failure). However, the MySQL on the east coast of the United States contains a small segment of writes that has not yet been replicated to the cluster on the west coast of the United States, and after the failover, because the clusters in both data centers now contain writes that do not exist in the other data center. Therefore, it is not possible to safely fail over the primary database back to the East Coast of the United States.

In the end, in order to ensure that user data is not lost, GitHub has to repair data consistency at the cost of 24-hour service degradation.

The severity of data inconsistency is self-evident. However, etcd is a distributed and highly reliable storage system based on raft protocol, and we do not do cross-city disaster recovery. According to reason, it is difficult for us to encounter high-end bug with inconsistent data. However, the dream is beautiful, and the reality is cruel. We not only encountered incredible data inconsistency bug, but also stepped on two. One is that restarting etcd has a lower probability of trigger, and the other is that if authentication is enabled when upgrading the etcd version, it will be triggered with a higher probability in K8s scenarios. Before discussing the two bug in detail, let's take a look at the problems caused by etcd data inconsistency in the K8s scenario.

The most frightening thing about the data inconsistency is that the client write is successful, but it is possible to read empty or old data on some nodes, and client cannot sense that the write fails on some nodes and may read the old data.

Reading null may lead to the disappearance of business Node, Pod and Service routing rules on Node. In general, only services changed by users will be affected.

Reading old data will cause business changes not to take effect, such as service scaling, Service rs replacement, change mirror exception, etc. In general, it will only affect the services changed by users.

In the etcd platform migration scenario, client cannot perceive the write failure. If the verification data consistency is not abnormal (the verification is connected to a normal node), it will lead to a complete failure of the whole cluster after migration (apiserver is connected to the abnormal node), and the user's Node, deployed services, lb, etc. will be deleted, seriously affecting the user's business.

First of all, the first inconsistent bug was encountered in the process of restarting etcd, and manual attempts to repeat it failed many times. The way to analyze, locate, reproduce and solve this bug experienced several twists and turns, and the process was very interesting and challenging. Finally, I added debug logs to key points and wrote chaos monkey to simulate various abnormal scenarios and boundary conditions to achieve successful reproduction. In the end, the real culprit is that an authorization interface is replayed after restarting, resulting in inconsistent authentication version numbers, and then zooming in leads to inconsistent multi-version databases, and some nodes are unable to write new data, affecting the 3-year bug of all v3 versions.

Then we submitted several related pr to the community and merged them all, the latest etcd v3.4.9 [1], v3.3.22 [2] fixed this problem, and google's jingyih also mentioned K8s issue and pr [3] to upgrade the etcd client and server versions of K8s 1.19 to the latest v3.4.9. For details of this bug, please refer to the article written by extraordinary students for three years of etcd3 data inconsistent bug analysis.

The second inconsistent bug is encountered in the process of upgrading etcd. Due to the lack of key error logs in etcd, there is not much effective information on the fault site, and it is difficult to locate, which can only be solved by analyzing the code and repeating it. However, the manual attempt to reproduce failed many times, so we simulated the client behavior scenario through chaos monkey, dispatched the etcd allocation requests of all K8s clusters in the test environment to our recurrence cluster, and compared the differences between version 3.2and 3.3.We added a large number of key logs in doubtful points such as lease and txn modules, and printed error logs for etcd apply request failure scenarios.

Through the above measures, we quickly reproduced it successfully. Finally, through the code and log, we found that there was a difference in revoke lease permissions between version 3.2 and version 3.3. When lease expires, if leader is 3.2, the request will fail due to lack of permission on the 3.3node, resulting in inconsistent number of key, inconsistent mvcc version number, and failure of execution of some scenarios in txn transactions. The latest 3.2branch has also merged the repair scheme we submitted. At the same time, we have added the error log of etcd core process failure to improve the efficiency of locating data inconsistencies, improved the upgrade documentation, and specified that lease will cause data inconsistencies in this scenario to prevent you from mining again.

From the inconsistent bug of these two data, we have gained the following gains and best practices:

The theoretical data consistency of the algorithm does not mean that the overall service implementation can guarantee data consistency. At present, for this kind of distributed storage system based on log replication state machine, there is no core mechanism to ensure the cooperation of raft, wal, mvcc, snapshot and other modules. Raft can only ensure the consistency of log state machine, but can not guarantee that the command corresponding to these logs will be successful in the application layer.

There are certain risks in etcd version upgrade. It is necessary to carefully evaluate whether there are incompatible features in review code. If there is an impact on authentication version number and mvcc version number, it may lead to data inconsistency in the upgrade process. At the same time, the existing network cluster must be changed in grayscale.

Added consistency patrol alarm to all etcd clusters, such as revision difference monitoring, key quantity difference monitoring, etc.

Backup the etcd cluster data regularly, and even a small probability of failure may occur according to Murphy's law. Even though etcd itself has complete automated testing (unit test, integration test, e2e test, fault injection test, etc.), there are still many test cases that cannot be covered. We need to prepare for the worst scenario (for example, 3 nodes wal, snap, db files are damaged at the same time). Reduce losses in extreme cases and achieve rapid recovery of available backup data

The grayscale of the cluster after etcd v3.4.4 enables the data corruption detection function. When the cluster is inconsistent, it rejects the cluster writing and reading, stops the loss in time, and controls the inconsistent data range.

Continue to improve our chaos monkey and use etcd's own fault injection testing framework functional to help us verify and test the stability of the new version (running for a long time), reproduce the deeply hidden bug, and reduce the probability of online mining pits.

Memory leak (OOM)

It is well known that etcd is written by golang, and will golang's built-in garbage collection mechanism cause memory leaks? First of all, we have to understand the principle of golang garbage collection, which is to run a daemon thread in the background to monitor the status of each object, identify and discard objects that are no longer in use to release and reuse resources. If you do not release objects for a long time, golang garbage collection is not omnipotent. For example, the following scenarios can cause memory leaks:

Goroutine leakage

Deferring function calls (for example, anonymous functions are not used in the for loop to call defer to release resources in time, but only at the end of the whole for loop)

Getting a segment of the string/slice causes the long string/slice not to be freed (sharing the same underlying block of memory)

Poor management of application memory data structures leads to memory leaks (such as cleaning up expired and invalid data in a timely manner)

Next, let's take a look at what kind of etcd memory leak we encountered. It started when I got up on a weekend at the end of March and received an alarm that a large amount of memory in the existing 3.4 cluster exceeded the security threshold. I immediately checked and found the following phenomena:

QPS and traffic monitoring display are low, so high load and slow query factors are excluded.

Among the three nodes in a cluster, only two follower nodes are abnormal, and the leader 4g focus follower node is as high as 23G.

Goroutine, fd and other resources are not leaked.

The go runtime memstats indicator shows that the memory requested by each node is consistent, but the go_memstats_heap_release_bytes of the follower node is much lower than that of the leader node, indicating that a data structure may not be released for a long time.

The production cluster turned off pprof by default, enabled pprof, waited for a repeat, and searched the community for similar cases. As a result, it was found that many users reported in January, which did not attract the attention of the community, and the use scenarios and phenomena are the same as ours.

The rapid positioning through the community's heap stack is due to the fact that etcd manages the state of the lease through a heap heap, which needs to be removed from the heap when the lease expires, but the follower node does not do so, resulting in a follower memory leak, affecting all version 3.4.

After a clear analysis of the problem, the fix I submitted is that the follower node does not need to maintain the lease heap, and when the leader is elected, make sure that the new follower node can rebuild the lease heap, while the old leader node empties the lease heap.

This memory leak bug is caused by poor management of memory data structures. As soon as the problem was fixed, the etcd community released a new version (v3.4.6+) and K8s immediately updated the etcd version.

From this memory leak bug, we got the following gains and best practices:

Continue to pay attention to the community issue and pr, other people's problems today are likely to be encountered tomorrow

The etcd test itself cannot cover this kind of resource leakage bug that takes a certain amount of time to trigger. We need to strengthen the testing and stress testing of such scenarios internally.

Continuously improve and enrich all kinds of monitoring alarms of the etcd platform, and the machine reserves enough memory buffer to bear all kinds of unexpected factors.

Storage layer deadlock (Mvcc Deadlock)

Deadlock means that during the execution of two or more goroutine, due to the competitive resources waiting for each other (usually locks) or due to communication with each other (caused by chan), the program is stuck and can not provide external services. Because the deadlock problem is often caused by the resource competition in the concurrent state, it is generally difficult to locate and reproduce. The nature of deadlock determines that we must keep the analysis site, otherwise it is difficult to analyze and reproduce.

* * so how do we find a solution to this deadlock bug? * * the problem arises from the fact that when the internal team was testing the etcd cluster, it found that a node suddenly failed and was unable to recover, unable to obtain information such as the number of key. After receiving the feedback, I analyzed the stuck etcd process and checked the monitoring, and reached the following conclusions:

Rpc requests without raft and mvcc modules, such as member list, can return results normally, while all rpc requests are context timeout.

The error reporting logic of 503503 returned by etcd health health monitoring also goes through raft and mvcc.

The exception of raft network module is eliminated by tcpdump and netstat, and the suspicious target is reduced to mvcc.

When analyzing the log, it is found that because the data lags behind leader, a data snapshot is received, and then when the snapshot is updated, it is stuck, and the log loaded by the snapshot is not output. At the same time, make sure that the log is not lost.

Check the code loaded by the snapshot, lock several suspicious locks and related goroutine, and prepare to get the stuck goroutine stack

The goroutine stack is obtained through kill or pprof. According to the time when Goroutine is stuck and the code logic of the related suspicious points, two goroutine competing resources are successfully found. One of them is to perform snapshot loading and reconstruct the main goroutine of db. It acquires a mvcc lock waiting for all asynchronous tasks to finish, while the other goroutine executes the historical key compression task. When it receives the stop signal, it immediately exits and invokes a compactBarrier logic. And this logic just needs to acquire the mvcc lock, so the deadlock occurs, and the stack is as follows.

This bug has also been hidden for a long time, affecting all etcd3 versions, with a large number of writes in the cluster, and a more backward node performing snapshot reconstruction while doing historical version compression at the same time, which will be triggered. The fix PR I submitted has also been merged into the 3.3and 3.4 branches, and the new version has been released (v3.3.21repair PR v3.4.8+).

From this deadlock bug, we get the following gains and best practices:

The etcd automation test cases of the combination of multiple concurrent scenarios are not covered and difficult to build, so it is easy to produce bug. Are there any other similar scenarios that have the same problem? You need to participate in the community to continue to improve etcd test coverage (etcd previously introduced that more than half of the code is already test code) in order to avoid such problems.

Although monitoring can detect the outage of abnormal nodes in time, we will not restart etcd automatically before deadlock, so we need to improve our health detection mechanism (such as curl / health to determine whether the service is normal), so that we can keep the stack and restart the recovery service automatically when deadlock occurs.

For scenarios with high read requests, you need to evaluate whether the QPS capacity provided by the remaining two nodes in a 3-node cluster can support business after one node is down. If not, consider a 5-node cluster.

Wal crash (Panic)

Panic refers to serious runtime and business logic errors that cause the entire process to exit. Panic is no stranger to us. We have encountered it several times in the current network, and the earliest instability factor we have encountered is panic in the process of cluster operation.

Although our 3-node etcd cluster can tolerate a node failure, crash still has an impact on users instantly, and even cluster connection failure occurs.

The first crash bug we encounter is to find that there is a certain probability of crash when there are a large number of cluster links, and then someone in the stack view community has reported grpc crash (issue) [4] because the component grpc-go that etcd depends on appears grpc crash (pr) [5], while the recent crash bug [6] is caused by the release of a new version of v3.4.8/v3.3.21, which has a lot to do with us. * * We have contributed 3 PR to this version, accounting for more than half, so how is this crash bug generated and reproduced? * * could it be our own pot?

First of all, the crash error report is walpb: crc mismatch, and we did not submit the code to modify the wal-related logic to exclude our own pot.

Secondly, through the new version of review, pr, the target was introduced by a big shot at google when fixing a crash bug caused by a successful wal write and a failed snapshot write.

But how exactly was it introduced? Pr contains several test cases to verify the new logic, and the local creation of empty clusters and the use of stock clusters (relatively small data) can not be reproduced.

There is too little error log information to determine which function reported the error, so first of all, add the log, after adding the error log to each suspicious point, after our test cluster randomly found an old node to replace the version, and then easily reproduced it, and determined that it is a newly added pot to verify the validity of snapshot files, so why does it appear crc mismatch? First of all, let's take a brief look at the wal file.

Etcd any request from a raft module will be persisted through the wal file before it is written to the etcd mvcc db. If the process is killed during the apply command process, the data can be made up by replaying the wal file when the process is restarted to avoid data loss. The wal file contains a variety of request commands, such as member change information, various operations related to key, and so on. In order to ensure data integrity and non-corruption, each record of wal calculates its crc32 and writes it to the wal file. During the parsing of the wal file after restart, the integrity of the record will be checked, and crc32 mismatch will appear if the data is corrupted or the crc32 calculation algorithm changes.

There is no exception in the hard disk and file system, and data corruption is eliminated. After an in-depth investigation of the calculation of the crc32 algorithm, it is found that the new logic does not deal with the crc32 type of data record, which will affect the value of the crc32 algorithm, resulting in differences, and will only be triggered when the first wal file is reclaimed after the creation of the etcd cluster. Therefore, 100% of the cluster that has been running for a period of time will be reproduced.

The solution is to add the processing logic of the crc32 algorithm and unit tests to cover the scenario in which the wal file is recycled. The community has merged and released new versions 3.4 and 3.3 (v3.4.9/v3.3.22).

Although this bug is feedback from community users, we have gained the following gains and best practices from this crash bug:

Unit test cases are very valuable, but it is not easy to write complete unit test cases, and various scenarios need to be considered.

The etcd community has almost zero test cases for stock cluster upgrade and compatibility between versions, so we need to work together to make the test cases cover more scenarios.

The new version of the launch internal process standardization, automation, such as test environment pressure testing, chaos testing, performance comparison of different versions, priority in non-core scenarios (such as event), grayscale launch and other processes are essential.

Quota and speed limit (Quota&QoS)

Etcd will consume a lot of cpu, memory and bandwidth resources when facing some large data query (expensive read) and write operations (expensive write), such as full key traversal (full keyspace fetch), a large number of event queries, list all Pod, configmap writes, etc., which can easily lead to overload and even avalanches.

However, etcd only has a very simple speed limit protection. When the commited index of etcd is greater than applied index and the threshold is greater than 5000, all requests will be rejected and Too Many Request will be returned. Its defect is obvious. It is impossible to accurately limit the speed of expensive read/write and effectively prevent cluster overload from being unavailable.

In order to solve the above challenges and avoid cluster overload, we use the following solutions to ensure cluster stability:

Based on the upper speed limit capability of K8s apiserver, for example, apiserver defaults to write 100max s and read 200max s.

Controlling unreasonable Pod/configmap/crd number based on K8s resource quota

The-terminated-Pod-gc-threshold parameter based on K8s controller-manager controls the number of invalid Pod (the default value of this parameter is as high as 12500, there is a lot of room for optimization)

Based on the independent storage of all kinds of apiserver resources in K8s, different etcd clusters are used for event/configmap and other core data, which can not only improve storage performance, but also reduce the failure factors of core main etcd.

Rate control of apiserver requests for reading and writing event based on event admission webhook

According to different business conditions, adjust event-ttl time flexibly to minimize the number of event

The QoS feature is developed based on etcd, and a preliminary design scheme has been submitted to the community to support setting QoS rules based on multiple object types (such as requesting key prefix paths by grpcMethod, grpcMethod+, traffic, cpu-intensive, latency).

Through multi-dimensional cluster alarm (etcd cluster lb and node itself incoming and outgoing traffic alarm, memory alarm, refined to each K8s cluster resource capacity abnormal growth alarm, cluster resource read and write QPS abnormal growth alarm) to prevent and avoid possible cluster stability problems in advance.

* * Multi-dimensional cluster alarms play an important role in ensuring the stability of our etcd, helping us find problems with users and our own cluster components many times. * * user problems such as bug before a certain internal K8s platform, writing a large number of cluster CRD resources and client read and write CRD QPS are obviously on the high side. Our own component problems, such as an old log component, when the size of the cluster increases, due to unreasonable frequent calls to list Pod by the log component, the etcd cluster traffic is as high as 3Gbps, and 5XX errors also occur in apiserver itself.

Although through the above measures, we can greatly reduce the stability problems caused by expensive read, however, from the effect of online practice, we still rely on cluster alarms to help us locate some abnormal client calls, and cannot automatically carry out precise and intelligent speed limits on abnormal client. Because the etcd layer is unable to distinguish which client call is called, if the speed limit on the etcd side will mistakenly kill the request of the normal client, it relies on the fine speed limit function of apiserver. The community has introduced an API Priority and Fairness [7] in 1.18, which is currently a version of alpha, and expects this feature to be stable as soon as possible.

Case Analysis of performance Optimization

Etcd read and write performance determines how large a cluster we can support and how many client concurrent calls we can support. Startup time determines how long it will take to re-provide service when we restart a node or receive a snapshot reconstruction of leader because we are too much behind leader. Case analysis of performance optimization we will briefly introduce how to optimize the performance of etcd from the following aspects: the startup time is reduced by half, the password authentication performance is improved by 12 times, and the query key number performance is improved by 3 times.

Performance optimization of start-up time, number of query key and specified number of records

When the db size reaches 4G and the number of key is one million, it is found that it takes as long as 5 minutes to restart a cluster, and the query on the number of key also times out. After adjusting the timeout, it is found that the timeout is as high as 21 seconds, and the memory soars by 6 GB. At the same time, the scenarios in which the query returns only a limited number of records (for example, the business uses etcd grpc-proxy to reduce the number of watch, and etcd grpc proxy initiates a limit read query on the watch path when creating watch by default) is still very time-consuming and memory expensive. So in my spare time on the weekend, I conducted an in-depth investigation and analysis of these questions, where on earth did I spend my start-up time? Is there room for optimization? Why is it time-consuming and memory-intensive to query the number of key?

With these problems, this paper makes an in-depth analysis and positioning of the source code. First of all, let's look at the time-consuming and memory-intensive problems of querying the number of key and returning only the specified number of records. The conclusions are as follows:

The previous implementation of etcd when querying the number of key is to traverse the entire memory btree and store the corresponding revision of key in the slice array.

The problem is that when there is a large number of key, slice expansion involves data copy, and slice also requires a lot of memory overhead.

Therefore, the optimization scheme is to add a new CountRevision to count the number of key, and there is no need to use slice. After optimization, the performance is reduced from 21s to 7s without any memory overhead.

For the problem that querying specified record data is time-consuming and memory overhead is very large, through analysis, it is found that the number of limit records is not pushed down to the index layer. By pushing the query limit parameters down to the index layer, the performance of limit query in big data scenario is improved a hundredfold without additional memory overhead.

If we look at the problem that the startup time-consuming problem is too high, by adding logs to each stage of startup time-consuming, the following conclusions are drawn:

The cpu resource etcd process on the machine is not fully utilized at startup

9% time-consuming when opening the backend db, such as mmap the entire db file to memory

91% of the time is spent rebuilding the memory index btree. When etcd receives a request Get Key, after the request is passed to the MVC layer layer by layer, it first needs to look up the version number of key from the memory index btree, then find the corresponding value from boltdb according to the version number, and then return it to client. When rebuilding the memory index btree number, it is exactly the opposite process, traversing the boltdb, constantly traversing from the version number 0 to the maximum version number, parsing the corresponding key, revision and other information from the value, and rebuilding the btree. Because this is a serial operation, the operation is very time-consuming.

Try to optimize the serial build btree into a high concurrent build, and try to make use of all the accounting power. After compiling the new version and testing, it is found that the effect is very little, so the compilation of the new version prints the detailed time-consuming analysis of each stage of rebuilding the memory index. It is found that the bottleneck lies in the insertion of the memory btree, which has a global lock, so there is almost no room for optimization.

Continue to analyze 91% time-consuming and find that rebuilding the in-memory index has been called twice. The first is to obtain a key consistent index variable of mvcc, which is a key data structure to ensure that etcd commands will not be repeated, and the data inconsistent bug we mentioned earlier also happens to be closely related to consistent index.

The consistent index implementation is unreasonable and is encapsulated in the mvcc layer, so I mentioned a pr to refactoring this feature as a separate package, providing various methods for etcdserver, mvcc, auth, lease and other modules to call.

After the feature reconstruction, the consistent index no longer needs to be obtained by rebuilding the memory index and other logic when it is started. The method optimized to call the cindex package to get the consistent index quickly reduces the whole time from 5min to 2 minutes and 30 seconds. Therefore, the optimization also depends on the refactoring of consistent index features, and the major changes are not backport to 3.4Universe 3.3. In the future version 3.5, when the amount of data is large, you can enjoy a significant improvement in startup time.

12-fold improvement in password authentication performance

An internal business service has been running well. One day, after a slight increase in client, a large number of timeouts occurred in the existing network etcd cluster, and all kinds of twists and turns did not play a role in switching cloud disk types, switching deployment environment, and adjusting parameters. After receiving help, asking for metrics and logs, and after some investigation, the following conclusions are drawn:

The phenomenon is really weird. Db delay-related indicators show no abnormality, and the log does not have any valid information.

According to the business feedback, a large number of read requests timed out, which can even be simply repeated through the etcdctl client tool, but the number of metrics related to read requests corresponding to metric is 0.

Guide the user to open trace log and metrics to enable extensive mode, and found that there are no trace logs after opening extensive. However, after enabling extensive, I found that all the time was spent on the Authenticate interface, and the business feedback was based on password authentication rather than certificate-based authentication.

Try to let the business students temporarily shut down authentication to test whether the business is restored. After the business students found a node to turn off authentication, the node immediately returned to normal, so they chose to temporarily restore the existing network business by shutting down authentication.

Then why does authentication take so long? We added logs to suspicious places, printed the time-consuming of each authentication step, and found that there was a timeout while waiting for the lock, but why did the lock take so long? It is found that during the locking process, the bcrpt encryption function is called to calculate the password hash value, which costs about 60ms each time, and the maximum time spent waiting for the lock is as high as 5s +.

Therefore, we write a new version to reduce the scope of locks and reduce the lock blocking time. After users use the new version and enable authentication, the business will no longer time out and return to normal.

Then we submitted the repair plan to the community and wrote a stress test tool to test the improved performance up to nearly 12 times (8-core 32G machine, from 18Unix to 202max), but it was still slow. Password verification calculation is mainly involved in the authentication process, and there is also a problem of slow password authentication reported by users in the community. This optimization is already included in the latest v3.4.9 version, and the performance can be further improved by adjusting bcrpt-cost parameters.

This is the end of the content of "what is the method of stability and performance optimization of ten thousand-level K8s cluster". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.