Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How Envoy maps connections to threads

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how Envoy maps connections to threads. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

One of the most common technical problems I get about Envoy is that it requires a low-level description of the threading model it uses. The editor will describe how Envoy maps connections to threads, as well as a description of the thread local storage (TLS) system used internally to make the code extremely parallel and high performance.

Threading overview

Envoy uses three different types of threads, as shown in figure 1.

Main: this thread has server startup and shutdown, all xDS API processing (including DNS, health check and regular cluster management), runtime, statistical refresh, management and general process management (signaling, hot start, etc.). Everything that happens on this thread is asynchronous and "non-blocking". Typically, the main thread coordinates all key process functions that do not require a large amount of CPU to complete. This allows most management code to be written as a single thread.

Worker: by default, Envoy generates one worker thread for each hardware thread in the system. (this can be controlled by the-- concurrency option). Each worker thread runs a "non-blocking" event loop that listens to each listener (there are currently no listener fragments), accepts new connections, instantiates the filter stack for connections, and handles the lifecycle of all IO. Connect. Again, this allows most connection handling code to be written as if it were single-threaded.

File refresher: each file (mainly the access log) written by Envoy currently has a separate blocking refresh thread. This is because writing to file system cache files using O_NONBLOCK sometimes blocks (sighs). When the worker thread needs to write to the file, the data is actually moved into the memory buffer and finally refreshed by the file refresh thread. This is an area of code where technically all staff can prevent the same lock from trying to populate the memory buffer. There are some other things that will be discussed further below.

Connection processing

As mentioned above, all worker threads listen to all listeners without any shards. Therefore, the kernel is used to intelligently dispatch accepted sockets to worker threads. Modern kernels are generally good at this; they use features such as IO priority escalation to try to populate the work of threads, and then start using other threads that listen on the same socket at the same time, and do not use a single spin lock to handle each acceptance.

Once the worker accepts the connection, it will never leave the worker. All further connection processing is fully handled within the worker thread, including any forwarding behavior. This has some important implications:

All connection pooling in Envoy is per worker thread. Therefore, although the HTTP / 2 connection pool establishes only one connection to each upstream host at a time, if there are four workstations, each upstream host will have four HTTP / 2 connections in a steady state.

The reason Envoy works this way is because by keeping all the code in a single worker thread, almost all code can be written without locks, as if it were single-threaded. This design makes most of the code easier to write and extends very well to an almost unlimited number of staff.

A major problem, however, is that the adjustment-concurrency option is actually very important in terms of memory and connection pooling efficiency. Having more staff than needed wastes memory, creates more idle connections, and results in lower connection pool hit rates. At Lyft, our side car Envoys runs with very low concurrency, so the performance roughly matches the service next to them. We only run our marginal Envoys with maximum concurrency.

What is non-blocking

So far, the term "non-blocking" has been used many times when discussing how the main thread and worker threads operate. All the code is written on the assumption that there is no blocking. However, this is not entirely true (is it entirely true? ). The Special Envoy did adopt some process wide locks:

As mentioned earlier, if the access log is being written, all working programs acquire the same lock before filling the memory access log buffer. Lock retention times should be very low, but such locks can compete with high concurrency and high throughput.

Envoy uses a very complex system to process thread-local statistics. This will be the subject of a separate post. However, I will briefly mention that as part of thread-local statistical processing, it is sometimes necessary to obtain a lock on the central "stat store". This locking should not be highly contentious.

The main thread needs to coordinate with all worker threads on a regular basis. This is done by "publishing" from the main thread to the worker thread (sometimes returning from the worker thread to the main thread). The publication needs to be locked so that the published message can be queued for later delivery. These locks should never be highly contended, but they can still be technically blocked.

When Envoy logs itself to standard error, it acquires process-wide locks. In general, Envoy local records are considered to perform poorly, so little consideration has been given to improving this.

There are other random locks, but they are not in the performance critical path and should never be contended for.

Thread local storage

Because Envoy separates the responsibilities of the main thread from those of the worker thread, complex processing needs to be done on the main thread and then made available to each worker thread in a highly concurrent manner. Envoy's advanced thread local storage (TLS) system is described below.

As already described, the main thread basically handles all the management / control plane functions in the Envoy process. The control plane is a bit overloaded here but seems appropriate when considering and comparing it with the forwarding made by the worker during the envoy process. It is a common pattern for the main thread process to perform some work, and then each worker thread needs to be updated with the results of that work, and the worker thread does not need to acquire a lock each time it is accessed.

Envoy's TLS system works as follows:

Code running on the main thread can allocate process-wide TLS slots. Although abstract, this is actually a vector index that allows O (1) access.

The main thread can set arbitrary data into its slot. When this is done, the data is published to each working program as a normal event loop event.

The worker thread can read from its TLS slot and will retrieve any thread-local data available there.

Although very simple, this is a very powerful example, very similar to the concept of RCU locking. In essence, the worker thread never sees any change in the data in the TLS slot as it works. The change occurs only during the rest between work events). Special envoys use it in two different ways:

By accessing each staff member to store different data without any locking

By storing shared pointers to read-only global data for each worker. Therefore, each worker has a reference count of data that cannot be decremented while working. The old data is destroyed only when all staff have paused and loaded the new shared data. This is the same as RCU.

Cluster thread update

I will describe how TLS is used for cluster management. Cluster management includes xDS API processing and / or DNS and health checks.

Figure 3 shows the overall flow involving the following components and steps:

The cluster manager is a component within Envoy that manages all known upstream clusters, CDS API,SDS / EDS API,DNS, and active (out-of-band) health checks. It is responsible for creating a final consistent view of each upstream cluster, including discovered hosts and health.

The health checker performs an active health check and reports health changes back to the cluster manager.

Execute CDS / SDS / EDS / DNS to determine cluster membership. Status changes are reported back to the cluster manager.

Each worker thread is constantly running an event loop.

When the cluster manager determines that the state of the cluster has changed, it creates a new read-only snapshot of the cluster state and publishes it to each worker thread.

During the next rest, the worker thread updates the snapshot in the assigned TLS slot.

During the IO event that needs to determine which host to load balance, the load balancer queries the TLS slot for host information. No lock was obtained to perform this operation. Note also that TLS can also trigger events during updates so that load balancers and other components can recalculate caches, data structures, and so on. This is beyond the scope of this article, but is used everywhere in the code.

By using the previously described procedures, Envoy is able to process each request without any locking (except for those described earlier). Apart from the complexity of the TLS code itself, most code does not need to understand how threads work and can be written as a single thread. This makes most code easier to write and produces excellent performance.

Other subsystems that use TLS

TLS and RCU are widely used in Envoy. Other examples include:

Runtime (functional flag) override lookup: calculates the current functional flag override mapping on the main thread. Then use RCU semantics to provide read-only snapshots for each working program.

Routing table exchange: for the routing table provided by RDS, the routing table is instantiated on the main thread. Then use RCU semantics to provide read-only snapshots for each working program. This makes routing table exchanges effectively atomized.

HTTP date header caching: it turns out that calculating HTTP date headers on each request (when each core performs ~ 25K + RPS) is very expensive. Envoy calculates the date title approximately every half a second and provides it to each staff member through TLS and RCU.

There are other cases, but the previous example should provide good taste in the things used by TLS.

Known performance traps

Although the overall performance of Envoy is quite good, there are some known areas to be aware of when it is used with very high concurrency and throughput:

As already described in this article, all workers are currently locked when writing to the memory buffer of the access log. In the case of high concurrency and high throughput, batch access logs for each staff member will need to be batch processed at the cost of sequential delivery when the final file is written. Alternatively, the access log can become each worker thread.

Although statistics have been very optimized, there may be atomic contention for individual statistics at very high concurrency and throughput. The solution to this is for each worker counter to flush regularly to the central counter. This will be discussed in subsequent articles.

If Envoy is deployed in a scenario where there are few connections that require significant resources to handle, the existing architecture will not function properly. This is because there is no guarantee that connections are evenly distributed among staff. This can be solved by balancing worker connections, where the worker can forward the connection to another staff member for processing.

Envoy's threading model is designed to support programming simplicity and large-scale parallelism, but it can be a waste of memory and connection usage if improperly adjusted. The model allows it to perform well with very high number of workers and throughput.

As I mentioned briefly on Twitter, the design is also suitable for running on a full user-mode network stack such as DPDK, which could cause commercial servers to process millions of requests per second while performing full L7 processing. It will be interesting to see what will be built in the next few years.

After reading the above, do you have any further understanding of how Envoy maps connections to threads? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report