What is the etcd tuning method? 04/25 Update SLTechnology News&Howtos

What is the etcd tuning method?

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the method of etcd tuning". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the etcd tuning method"?

Tuning (Tuning)

The etcd default settings apply to the network with low latency. However, with high network latency, such as cross-domain data centers, the settings of heartbeat interval and election timeout need to be optimized (tuning).

Network slowness is not only due to latency, but may also be affected by the low-speed disk IO of Leader and follower. Each timeout setting should include a time from the request to the success of the response.

Time parameter

When a node hangs (stalls) slows down or goes offline, the distributed consistency Protocol (distributed consensus protocol) relies on two time parameters to ensure leadership switching.

The first parameter is called heartbeat interval (Heartbeat Interval). It represents the frequency of leader notifying all followers that he is still Leader. As a best practice, it should be set to the network round trip time (round-trip time) between nodes. The default heartbeat interval for etcd is 100ms.

The second parameter is the election Election TimeoutRTT. It indicates how long before follower receives the heartbeat of leader, it becomes Leader itself. The default election timeout for etcd is 1000ms.

Some tradeoffs need to be made to adjust the parameters. It is recommended to set the heartbeat interval to the maximum RTT before the node, which can be set to 0.5-1.5 times of RTT. If the heartbeat interval is too short, etcd will send unnecessary heartbeats to increase CPU and network usage. In addition, a long heartbeat interval can increase the election timeout. Excessively long election timeouts take longer to detect Leader failures. The easiest way to measure RTT is to use the PING tool.

Election timeout should be set based on the heartbeat interval and the average RTT of the node. The election timeout should be at least 10 times the RTT in order to be considered the variance of network latency (variance). For example, if the RTT between nodes is 10ms, then the timeout should at least be 100ms.

The maximum election timeout limit is 50000ms (50s), which should be used only if etcd is deployed globally. The RTT in the continental United States is 130ms, and the RTT in the United States and Japan is 350-400ms. If there is uneven network performance or regular network delay and loss, it will cause multiple network retries, so 5s is a safe maximum value of RTT. 5s is the highest value of heartbeat interval, so the maximum value of timeout should be 50s.

All nodes in a cluster should set the same heartbeat interval and election timeout. If the settings are different, the cluster may be unstable.

The default value can be overridden by command-line or environment parameters, in ms.

# Command line parameters:

$etcd-heartbeat-interval=100-election-timeout=500

# Environment parameters:

$ETCD_HEARTBEAT_INTERVAL=100 ETCD_ELECTION_TIMEOUT=500 etcd

Snapshot (Snapshots)

Etcd appends changes to the key to the log. These logs record changes to an key on a single line, and the logs continue to grow. There will be no problem with the growth of these logs when using etcd simply, but there will be more and more logs when there is a large cluster.

To avoid a large number of logs, etcd takes snapshots on a regular basis. These snapshots compress the log by saving the changes in the log to the current state and removing the old log.

Snapshot optimization

Creating snapshots is expensive for v2, so snapshots are taken only after a certain number of record operations have been changed. By default, snapshots are taken every 10000 changes. If the memory and disk usage of etcd is too high, you can lower this threshold (threshold).

# Command line arguments:

$etcd-snapshot-count=5000

# Environment variables:

$ETCD_SNAPSHOT_COUNT=5000 etcd

Magnetic disk

Etcd clusters are very sensitive to disk latency. Because etcd needs to store change logs, multiple processes operating on disks can cause higher fsync latency. These may cause etcd to lose its heartbeat, request timeout, or temporary loss of Leader. You can do this by increasing the disk priority of the etcd process.

In Linux, the disk priority of etcd can be configured through ionic

# best effort, highest priority

$sudo ionice-c2-n0-p `pgrep etcd`

The network

If etcd's Leader service has a large number of concurrent clients, this will cause the processing of follower requests to be delayed due to network delays. You can see a list of errors in the send buffer of follower.

Dropped MsgProp to 247ae21ff9436b2d since streamMsg's sending buffer is full

Dropped MsgAppResp to 247ae21ff9436b2d since streamMsg's sending buffer is full

These errors can improve the response of Leader requests by increasing the network priority of follower. It can be improved by flow control mechanism.

Tc qdisc add dev eth0 root handle 1: prio bands 3

Tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip sport 2380 0xffff flowid 1:1

Tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip dport 2380 0xffff flowid 1:1

Tc filter add dev eth0 parent 1: protocol ip prio 2 u32 match ip sport 2739 0xffff flowid 1:1

Tc filter add dev eth0 parent 1: protocol ip prio 2 u32 match ip dport 2739 0xffff flowid 1:1

At this point, I believe that you have a deeper understanding of "what is the etcd tuning method?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.