How to analyze linux futex 07/02 Update SLTechnology News&Howtos

How to analyze linux futex

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to carry out linux futex analysis, I believe that many inexperienced people are at a loss about this. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Futex,Fast Userspace muTEXes, as a fast synchronization (mutex) mechanism under linux, has been around for a long time (since linux 2.5.7). What are its advantages? What kind of functions are provided?

Before the birth of futex

Before the birth of futex, the synchronization mechanisms under linux can be classified into two categories: user-mode synchronization mechanism and kernel synchronization mechanism. The synchronization mechanism of user mode is basically spinlock implemented by atomic instructions. The simplest implementation is to use an integer, with 0 for unlocked and 1 for locked. The trylock operation uses atomic instructions to try to change 0 to 1:

Bool trylock (int lockval) {int old; atomic {old = lockval; lockval = 1;} / for example: xchg instruction return old = = 0;} under x86

Whether or not spinlock was locked in advance, it must have been locked after trylock. So the lock variable must be set to 1. The success of trylock depends on whether spinlock is locked in advance (old==1) or this time trylock is locked (old==0). Using atomic instructions can prevent multiple processes from seeing old==0 at the same time, and all think that they changed it to 1.

Spinlock's lock operation is an endless loop, trying trylock until it succeeds.

For some very small critical areas, using spinlock is very efficient. Because when the trylock fails, it can be expected that the thread (process) holding the lock will quickly exit the critical area (release the lock). So an endless loop of busy waiting is likely to be more efficient than a process suspending wait.

However, the application scenario of spinlock is limited, and busy waiting is a scary thing for large critical areas, especially when the synchronization mechanism is used to wait for an event (such as the server worker thread waiting for the client to initiate a request). So in many cases it is necessary for the process to suspend and wait.

The synchronization mechanism provided by the kernel, such as semaphore, etc., is essentially a spinlock implemented by atomic instructions, on which the kernel realizes the sleep and awakening of the process.

The use of such a lock can well support the process of suspending and waiting. But the biggest disadvantage is that each time lock and unlock are a system call, even if there is no lock conflict, it must be recognized after entering the kernel through the system call. For the problem of high system call overhead, you can refer to: "looking at the time-consuming of system calls from" read ". )

The ideal synchronization mechanism should be to solve the problem by using atomic instructions in the user state without lock conflicts, and then use the system calls provided by the kernel to sleep and wake up while waiting. In other words, can the user-mode spinlock suspend the process when the trylock fails and be awakened by the thread holding the lock during the unlock?

If you haven't thought about it more deeply, you're likely to take it for granted that it's something like this:

Void lock (int lockval) {while (! trylock (lockval)) {wait (); / such as: raise (SIGSTOP)}}

But if you do, there is a window between the trylock operation that detects the lock and the wait operation that suspends the process, and if the lock changes (for example, the holder of the lock releases the lock), the caller will enter the unnecessary wait, and no one can wake it up even after the wait. For details, see the discussion of Linux thread synchronization analysis. )

Before the birth of futex, it would be very awkward to achieve our ideal lock. For example, consider using sigsuspend system calls to suspend processes:

Class mutex {private:int lockval; spinlocked_set waiters; / / setpublic:void lock () {pid_t mypid = getpid (); waiters.insert (mypid) using spinlock as protection; / / first add yourself to mutex's waiting queue while (! trylock (lockval)) {/ / then try to lock / / when the process initializes, you need to drop SIGUSER1 mask, and open sigsuspend (MASK_WITHOUT_SIGUSER1) at this time. } waiters.remove (mypid) / / remove yourself from the waiting queue after successfully locking} void unlock () {lockval = 0; / / release the lock pid_t waiter = waiters.first () first / / check the waiting queue if (waiter! = 0) {/ / if someone is waiting, send a SIGUSER1 signal to wake up kill (waiter, SIGUSER1);}

Note that sigsuspend here is different from simple wait operations such as raise (SIGSTOP). If the kill operation used to wake up during unlock occurs before sigsuspend, sigsuspend can also be woken up. For details, see the discussion of Linux thread synchronization analysis. )

This implementation is a bit similar to the old version of phread_cond and should still be able to work. There are some things that don't feel good, such as the sigsuspend system call is global and doesn't just consider a lock. In other words, lockA's unlock can wake up a process waiting for lockB. Although the trylock continues after the process is woken up, it does not affect correctness; although in most cases lockA.unlock does not try to wake up the process waiting for lockB (except in some competitive cases), because the latter is probably not in the waiting queue of lockA.

On the other hand, the waiting queue implemented in user mode is not very good either. It is not aware of the life cycle of the process, and it is likely that the process is dead, but the pid remains in the queue (even after a while, another unrelated process reuses the pid, so that it may receive an inexplicable signal). Therefore, if you signal only one process in the queue during unlock, you will probably not be able to wake up any waiters. The practice of insurance can only be to wake up all of them, thus causing the phenomenon of "shock group". However, it doesn't matter if it is only used in multithreading (within the same process). After all, multithreading does not have a thread to hang up (if the thread dies, the whole process will die). In the case of thread exiting voluntarily in response to the signal, you can also pay attention to the problem of waiting for queue cleanup before actively exiting.

Futex is here.

Now it seems that in order to achieve the lock we want, there are two requirements for the kernel: 1, to support a lock-granularity sleep and wake-up operation; 2, to manage the waiting queue when the process is suspended.

So futex was born. Futex mainly has two operations: futex_wait and futex_wake:

/ / suspend waiting on the lock variable pointed to by uaddr (only if * uaddr==val) int futex_wait (int * uaddr, int val); / / Wake up n processes that suspend waiting on the lock variable pointed to by uaddr (int * uaddr, int n)

The kernel dynamically maintains a waiting queue related to the lock variable pointed to by uaddr.

Notice the second parameter of futex_wait, because there is a window between the user mode trylock and the call futex_wait, during which the lockval may change (for example, someone happens to be unlock). Therefore, the user mode should pass the value of * uaddr seen by the user as the second parameter. Futex_wait must check whether the lockval has changed before actually suspending the process, and the checking process must be placed in the same critical area as the pending process. (see the discussion of Linux thread synchronization analysis. If futex_wait finds that the lockval has changed, it will return immediately, and the trylock will continue in user mode.

Futex implements a lock-granularity wait queue, which does not need to be declared to the kernel in advance. Whenever a user-mode call futex_wait passes in a uaddr, the kernel maintains a paired waiting queue.

The matter sounds complicated, but in fact it is very simple. In fact, it does not need to maintain a separate queue for each uaddr, futex only maintains a total queue, where all pending processes are placed. Of course, the nodes in the queue need to be able to identify which uaddr the corresponding process is waiting for. In this way, when you call futex_wake in user mode, you only need to traverse the waiting queue and wake up the process corresponding to the node with the same uaddr.

As an optimization, the wait queue maintained by futex consists of several linked lists with spinlock. Call the futex_wait suspended process and uaddr hash it to a specific linked list. On the one hand, it can disperse the competition of the waiting queue, on the other hand, it can reduce the length of a single queue, which is convenient for futex_wake search. Each linked list holds a spinlock, protecting the "comparison operation of * uaddr and val" and the "queuing operation of processes" in a critical area.

Another problem is the comparison of uaddr parameters. Futex supports multiple processes, so we need to consider the problem that the virtual addresses of the same physical memory unit are different in different processes. Then how the uaddr passed in by different processes can determine whether they are equal is not a matter of simple numerical comparison. The same uaddr does not necessarily represent the same memory, and vice versa.

There are only two ways for two processes (threads) to share memory together: through file mapping (mapping real files or memory files, ipc shmem, and related processes sharing memory through anonymous mapping with MAP_SHARED tags), and through anonymous memory mapping (such as multithreading), which is the only way for processes to use memory.

Then futex should support uaddr comparison in these two ways. Under anonymous mapping, you need to compare the address space (mm) of uaddr with the value of uaddr; under file mapping, you need to compare the offset of inode and uaddr of the file where uaddr resides in the inode. Note that one of the memory sharing methods mentioned above is quite special: related processes share memory through anonymous mapping with MAP_SHARED tags. In this case, anonymous mapping is ostensibly used, but the kernel is secretly converted to a file mapping to the special file / dev/zero. Otherwise, the address space of each process is different, and the uaddr under anonymous mapping can never be considered equal by futex.

Futex and his brothers and sisters

Futex_wait and futex_wake are the basics of futex. After that, in order to optimize other synchronization methods, futex added several variants.

Calls to the futex waiting series can generally pass timeout parameters and support overtime wake-up. This piece of logic is relatively independent and will not be expanded in this paper.

Bitset series int futex_wait_bitset (int * uaddr, int val, int bitset); int futex_wake_bitset (int * uaddr, int n, int bitset)

An additional bitset parameter is passed, and a process that uses a specific bitset for wait can only be awakened by a wake call that uses its bitset superset.

This thing works well for read-write locks, marking whether you are waiting to read or write through bitset when the process is suspended. Unlock decides whether to wake up a write waiting process or a full read waiting process.

If there is no bitset function, either you can only wake up when you unlock without distinguishing between read wait and write wait, or you can only create two uaddr, read and write one of them, and then use spinlock to protect the synchronization of the two uaddr.

(see: http://locklessinc.com/articles/sleeping_rwlocks/)

Requeue series int futex_requeue (int * uaddr, int n, int * uaddr2, int N2); int futex_cmp_requeue (int * uaddr, int n, int * uaddr2, int N2, int val)

The function is somewhat similar to futex_wake, but not just to wake up n processes waiting for uaddr, and further, to move N2 processes waiting for uaddr to the waiting queue for uaddr2 (equivalent to futex_wake them, and then force them to futex_wait on top of uaddr2).

On the basis of futex_requeue, futex_cmp_requeue adds a judgment to perform the operation only if * uaddr is equal to val, otherwise it will be returned directly and let the user mode try again.

This is for pthread_cond_broadcast. Let's first review the logic of pthread_cond (listed, which will be used many times later):

Pthread_cond_wait (mutex, cond): value = cond- > value; / * 1 * / pthread_mutex_unlock (mutex); / * 2 * / retry:pthread_mutex_lock (cond- > mutex); / * 10 * / if (value = = cond- > value) {/ * 11 * / me- > next_cond = cond- > waiter;cond- > waiter = me; pthread_mutex_unlock (cond- > mutex); unable_to_run (me); goto retry } else pthread_mutex_unlock (cond- > mutex); / * 12 * / pthread_mutex_lock (mutex); / * 13 * / pthread_cond_signal (cond): pthread_mutex_lock (cond- > mutex); / * 3 * / cond- > value++; / * 4 * / if (cond- > waiter) {/ * 5 * / sleeper = cond- > waiter; / * 6 * / cond- > waiter = sleeper- > next_cond; / * 7 * / able_to_run (sleeper) / * 8 * /} pthread_mutex_unlock (cond- > mutex); / * 9 * /

Pthread_cond_broadcast is similar to pthread_cond_signal, except that it awakens all (not one) waiters. Notice that the first thing pthread_cond_wait does after being awakened is lock (mutex) (step 13). If pthread_cond_broadcast awakens N waiters at once, they are bound to scramble for mutex when they wake up, causing a "shock" phenomenon of thousands of troops crossing the single-log bridge.

As an optimization, instead of waking up all waiters with futex_wake, pthread_cond_broadcast should wake up one waker with futex_requeue, and then transfer all other processes to the waiting queue of mutex (and then wake up one by one by mutex's unlock).

Why is there a futex_cmp_requeue? Because futex_requeue is actually problematic, it is equivalent to dragging a batch of processes directly to the waiting queue of uaddr2 without doing a status check in the critical area (recall the importance of checking * uaddr==val in futex_wait). So, there is a window between entering the futex_requeue and actually moving the process to uaddr2, and there may be other threads futex_wake (uaddr2) in the gap, which will not wake up the processes that are about to move but have not yet moved, and may cause them to never be woken up again.

However, although the futex_requeue is not rigorous, the pthread_cond_broadcast case is OK, because no one can futex_wake (uaddr2) when the pthread_cond_broadcast wakes up the waiters, because the lock is being held by the pthread_cond_broadcast and will not be released until the wake-up operation is over (step 9). This is why futex_requeue has a problem, but it has been openly release.

Wake & Operatorint futex_wake_op (int * uaddr1, int * uaddr2, int N1, int N2, int op)

This system call is a bit like the idea of CISC, with a lot of actions in one call. It attempts to wake up N1 processes in the waiting queue of uaddr1, then modify the value of uaddr2, and wake up N2 processes in the uaddr2 queue if the value of uaddr2 satisfies the condition. How to modify the value of uaddr2? What conditions do you need to meet to wake up uaddr2? The logic is pack in the op parameter.

The op parameter of type int is actually a struct:

Struct op {/ / modify * uaddr2 method: SET (* uaddr2=OPARG), ADD (* uaddr2+=OPARG), / / OR (* uaddr2 | = OPARG), ANDN (* uaddr2&=~OPARG), XOR (* uaddr2 ^ = OPARG) int OP: 4 take / judge whether * uaddr2 satisfies the condition: EQ (= =), NE (! =), LT (=) int CMP: 4umbint OPARG: 12ash / modify * uaddr2 Parameter int CMPARG: 12X / judge * uaddr2 satisfies the condition}

With such a complex set of logic, futex_wake_op only wants to handle two locks in one system call, which is equivalent to calling futex_wake twice in user mode.

Suppose you need to release both uaddr1 and uaddr2 locks in user mode (0: unlocked, 1: locked, 2: locked and process pending). If you don't use futex_wake_op, you need to write as follows:

Int old1, old2;atomic {old1 = * uaddr1; * uaddr1 = 0;} if (old1 = = 2) {futex_wake (uaddr1, N1);} atomic {old2 = * uaddr2; * uaddr2 = 0;} if (old2 = = 2) {futex_wake (uaddr2, N2);}

With futex_wake_op, you only need to do this:

Int old1;atomic {old1 = * uaddr1; * uaddr1 = 0;} if (old1 = = 2) {futex_wake_op (uaddr1, N1, uaddr2, N2, {/ / op parameter meaning: set * uaddr2=0, and if old2==2, execute Wake up OP=SET, OPARG=0, CMP=EQ, CMPARG=2});} else {. / / handle uaddr2} separately

It's not just a matter of saving a system call to make it so complicated. Because it is possible that after unlock (uaddr1), the awakened process will immediately go to lock (uaddr2). At this time, if there is no time to unlock (uaddr2), the awakening process will be suspended immediately, and then the unlock (uaddr2) will be awakened again. Isn't that troublesome?

This scenario could happen between pthread_cond_wait and pthread_cond_signal. When pthread_cond_signal wakes up the waiters, it releases the internal lock (step 9). Immediately after being awakened, pthread_cond_wait tries to acquire the internal lock to re-check the status (step 10). Had it not been for futex_wake_op 's passing on the actions of wake-up and lock release, there would have been strong competition.

Of course, using the futex_cmp_requeue mentioned earlier can also avoid excessive competition. Instead of waking up the waiters directly, pthread_cond_signal will requeue them to the waiting queue of the internal lock, which will not really wake up until the lock is released here. However, since pthread_cond_signal will release the internal lock immediately, it is somewhat verbose to requeue and then wake.

Priority Inheritance series int futex_lock_pi (int * uaddr); int futex_trylock_pi (int * uaddr); int futex_unlock_pi (int * uaddr); int futex_wait_requeue_pi (int * uaddr1, int val1, int * uaddr2); int futex_cmp_requeue_pi (int * uaddr, int N1, int * uaddr2, int N2, int val)

Priority Inheritance, priority inheritance, is one way to solve the problem of priority inversion.

Futex_lock_pi/futex_trylock_pi/futex_unlock_pi is a futex lock operation with priority inheritance.

Futex_cmp_requeue_pi is an inherited version of futex_cmp_requeue,futex_wait_requeue_pi with priority and is used in conjunction with it to replace the normal futex_wait.

After reading the above, have you mastered the method of how to analyze linux futex? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.