How to implement linux system call 07/19 Update SLTechnology News&Howtos

How to implement linux system call

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how the linux system call is implemented, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

This picture has been drawn for a long time, mainly so that you can look at the implementation of system calls in the linux kernel from a global point of view.

Before going into the details, let's take a look at the implementation of the system call as a whole according to the figure above.

The implementation of system calls is based on two assembly instructions, syscall and sysret.

Syscall switches the execution logic from the user state to the kernel state. After entering the kernel state, cpu will obtain the starting address of the system call kernel code from the MSR_LSTAR register, that is, the entry_SYSCALL_64 above.

When executing the entry_SYSCALL_64 function, the kernel code first gets the number of the system call you want to execute from the rax register, and then finds the corresponding system call function from the sys_call_table array according to that number.

Next, get the parameters needed for the system call function from the rdi, rsi, rdx, R10, R8, R9 registers, and then call the function to pass these parameters into it.

After the execution of the system call function is completed, the execution result is put into the rax register.

Finally, the sysret assembly instruction is executed to switch from kernel mode to user mode, and the user program continues to execute.

If the user program needs the return result of the system call, it gets it from the rax.

The overall process is like this, relatively speaking, it is relatively simple, mainly to understand syscall and sysret these two assembly instructions, on the basis of understanding these two assembly instructions, and then look at the kernel source code, it will be much easier.

For a detailed description of the syscall and sysret directives, refer to Intel ®64 and IA-32 Architectures Software Developer's Manual.

With the above understanding of system calls, let's take a look at the specific implementation details.

Take write system call as an example, the corresponding kernel source code is:

In the kernel, all system call functions are defined by macros such as SYSCALL_DEFINE, such as the write function above, which uses SYSCALL_DEFINE3.

After expanding the macro, we can get the following function definition:

As can be seen from the above, after the SYSCALL_DEFINE3 macro is expanded, there are three functions, of which only _ _ x64_sys_write is externally accessible, and the other two are modified by static and cannot be accessed externally, so this should be the function registered in the sys_call_table array mentioned above.

So how did the function register with the array?

Without giving the answer, let's take a look at the definition of the sys_call_table array:

As you can see from above, the default value for each element of the array is _ _ x64_sys_ni_syscall:

This function is also very simple, which directly returns the error code-ENOSYS, indicating that the system call is illegal.

The sys_call_table array definition seems to set only the default value, not the actual system call function.

Let's look elsewhere to see if there is code that registers the real system call function into the sys_call_table array.

Unfortunately, no.

This is strange, so where on earth are the system call functions registered?

Let's go back and take a closer look at the definition of the sys_call_table array. After setting the default value, it also include a header file called asm/syscalls_64.h. This location of the include header file is still quite strange. Let's take a look at what it contains.

However, this file does not exist.

Then we can only initially suspect that the header file was generated at compile time. With this doubt, we searched for the relevant content and did find some clues:

This file is indeed generated at compile time, and the syscalltbl.sh script and syscall_64.tbl template file are used in the above makefile to generate the syscalls_64.h header file.

Let's look at the contents of the syscall_64.tbl template file:

The write system call is indeed defined and marked with a number of 1.

Let's take a look at the generated syscalls_ 64.h header file:

It defines a lot of things like macro calls.

_ _ SYSCALL_COMMON, this is the macro where define is defined in the sys_call_table array.

Then take a look at the macro definition of _ _ SYSCALL_COMMON, which is used to assign the function represented by sym to the nr subscript of the sys_call_table array.

So for _ _ SYSCALL_COMMON (1, sys_write), it is to register the _ _ x64_sys_write function to the slot in the subscript 1 of the sys_call_table array.

And this _ _ x64_sys_write function, which we guessed above, is an externally accessible function after the expansion of the write system call defined by SYSCALL_DEFINE3.

It suddenly becomes clear that the registration of the real system call function is accomplished by first defining the _ _ SYSCALL_ com macro and then include the syscalls_ 64.h header file generated according to the syscall_64.tbl template, which is very ingenious.

The process of registering the system call function with the sys_call_table array is clear at this point.

Let's move on to see where this array is being used:

Do_syscall_64 is in use by finding the corresponding system call function in the sys_call_table array through nr, and then calling the function to pass regs into it.

This process is the same as we estimated above, and the type of regs parameter passed in is the same as the type required for the system call function we registered above.

That is to say, the fields of regs parameters take the parameters needed by each system call function. A series of functions developed by macros such as SYSCALL_DEFINE will extract the real parameters from these fields, then convert them, and finally these parameters are passed into the final system call function.

For the functions after the macro expansion of the above write system call, _ _ x64_sys_write will first extract the di, si, and dx fields from the regs as the real parameters, then _ _ se_sys_write will convert these parameters to the correct type, and finally the _ _ do_sys_write function will be called, and the converted parameters will be passed into them.

After the system call function is executed, the result is assigned to the ax field of regs.

As can be seen from the above, the parameters of the system call function and the return value are passed through regs.

But at the beginning of the article, didn't you say that the parameters of the system call and the return value are passed through the register, and how is it through the struct pt_regs field here?

Don't worry, take a look at the definition of struct pt_regs:

Have you noticed that the field names in this are all register names.

Does that mean that in the code that performs the system call, there is logic that puts the values in each register into the corresponding fields of this structure, and at the end of the system call, the values in these fields are assigned to the corresponding registers?

Getting closer and closer to the truth.

Let's move on to where do_syscall_64 is used:

The entry_SYSCALL_64 method in the figure above is the most important method in the system call process. In order to make it easier to understand, I made a lot of changes to this method and added a lot of comments.

One thing to note here is the logic of lines 100 to 121, which pushes the values of each register onto the stack to build the struct pt_regs object.

So you can build a struct pt_regs object?

Right.

Let's take a look at the definition of struct pt_regs above to see if the field names and order are the opposite of the stack order here.

Let's think again, when we want to build a struct pt_regs object, we need to allocate a piece of space in memory, and then use an address to point to that space, which is the pointer to the struct pt_regs object. It is important to note that the address in this pointer is the smallest address of this memory space.

Looking at the stack pressing process above, we can think of each stack pressing operation as allocating memory space and assigning values. when R15 is finally pushed into the stack, the entire memory space is allocated and the data is initialized. At this time, the address at the top of the stack pointed to by rsp is the smallest address of this memory space, because during the stack pressing process, the address at the top of the stack is always getting smaller.

To sum up, after stacking, the address in rsp is the address of a struct pt_regs object, that is, the pointer to that object.

After building the struct pt_regs object, line 123assigns the system call number stored in rax to rdx, line 124assigns the address of the struct pt_regs object stored in rsp, that is, its pointer, to rsi, and then executes the call instruction to call the do_syscall_64 method.

Before calling the do_syscall_64 method, rdi and rsi are assigned to comply with the c calling convention, because it is stipulated in this calling convention that when the c method is called, the first parameter is put in the rdi and the second parameter is put in the rsi.

Let's go to the above to see if the definition, parameter type and order of the do_syscall_64 method are exactly the same as what we are talking about here.

After calling the do_syscall_64 method, the whole process of the system call is almost over. Lines 129,133 in the figure above do some work of register recovery, such as popping the corresponding value from the stack to rax,rip,rsp and so on.

It is important to note that the value of rax in the stack is set in the above do_syscall_64 method, which stores the final result of the system call.

In addition, the values of rip and rsp popped up in the stack are the subsequent instruction addresses and stack addresses of the user-mode program, respectively.

Finally, execute sysret, switch from kernel state to user state, and continue to execute the logic behind syscall.

At this point, the complete system call processing flow is almost finished, but there is still a small step left, that is, how the syscall instruction finds the entry_SYSCALL_64 method after entering the kernel state:

It is actually registered in the MSR_LSTAR register, and after entering the kernel state, the syscall instruction will directly take the address of the system call handler from this register and start execution.

This is the logical processing of the kernel state of the system call.

Let's use an example to demonstrate the user mode part:

Compile and execute:

We use syscall to execute the write system call, and write the string Hi\ nDie SysCall after the execution is finished, we directly use the ret instruction to return the return result of write as the exit code of the program.

So in the above figure, the Hi is output, and the exit code of the program is 3.

If you don't quite understand the assembly above, think of it like this:

Here, we use the write method in glibc to execute the system call, which is actually a layer of encapsulation of the syscall instruction, essentially using the assembly code above.

This example ends here.

Do you feel not having a good time?

We analyzed so much code that we ended up with such a small example. No, we have to do something more.

Why don't we write a system call ourselves?

Just do it.

Let's first define our own system call under the write system call:

The method is simple: add the parameter by 10 and return it.

Compile the kernel and wait for execution.

Let's modify and compile the hi program written above:

Then start the newly compiled linux kernel in the virtual machine and execute the above program:

Look at the result, it's exactly 20.

After reading the above, do you have any further understanding of how linux system calls are implemented? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.