How to write Linux kernel module 01/18 Update SLTechnology News&Howtos

How to write Linux kernel module

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to write Linux kernel module". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to write Linux kernel module".

The Linux kernel is very different from its user space: putting aside carelessness, you have to be careful, because one bug in your programming will affect the entire system. Floating-point operations are not easy to do, the stack is fixed and narrow, and the code you write is always asynchronous, so you need to think about what concurrency leads to. In addition to all this, the Linux kernel is just a large, complex C program that is open to everyone, and anyone can read it, learn it, and improve it, and you can be one of them.

Perhaps the easiest way to learn kernel programming is to write a kernel module: a piece of code that can be loaded into the kernel dynamically. There is a limit to what modules can do-for example, they cannot add or subtract fields from common data structures such as process descriptors (LCTT). In other ways, however, they are mature kernel-level code that can be compiled into the kernel whenever needed (so that all limitations can be abandoned). You can develop and compile a module outside the Linux source tree (which is not surprising, it's called off-tree development), which is convenient if you just want to have a little fun and don't want to commit changes to include in the mainline kernel.

In this tutorial, we will develop a simple kernel module to create a / dev/reverse device. The string written to the device is read back in reverse word order ("Hello World" is read as "World Hello"). This is a popular programmer interview puzzle, and you can get some points when you use your capabilities to implement this function at the kernel level. Before you start, a piece of advice: a bug in your module can lead to a system crash (unlikely, but possible) and data loss. Before you begin, make sure that you have backed up important data, or, in a better way, try it in a virtual machine.

Do not use root identity as much as possible

By default, / dev/reverse can only be used by root, so you can only use sudo to run your test program. To resolve this limitation, you can create a / lib/udev/rules.d/99-reverse.rules file that contains the following:

SUBSYSTEM== "misc", KERNEL== "reverse", MODE= "0666"

Don't forget to reinsert the module. Giving non-root users access to device nodes is often not a good idea, but it is useful during development. This is not to say that it is not a good idea to run binary test files as root.

Construction of module

Since most Linux kernel modules are written in C (except for the underlying architecture-specific parts), it is recommended that you save your modules as a single file (for example, reverse.c). We've put the complete source code on GitHub-- here we'll look at some of the snippets. To start, we will include some common file headers and describe the module with predefined macros:

# include # include # include MODULE_LICENSE ("GPL"); MODULE_AUTHOR ("Valentine Sinitsyn"); MODULE_DESCRIPTION ("In-kernel phrase reverser")

Everything is straightforward here, except for MODULE_LICENSE (): it's not just a tag. The kernel firmly supports GPL-compatible code, so if you set the license to other non-GPL compatible (e.g., "Proprietary" [patent]), certain kernel features will not be available in your module.

When shouldn't write kernel module

Kernel programming is interesting, but writing (especially debugging) kernel code in real-world projects requires specific skills. Generally speaking, you should solve your problem at the kernel level when there is no other way to solve it. It may be better for you to solve it in user space in the following situations:

You want to develop a USB driver-- check out libusb.

You need to develop a file system-try FUSE.

You are expanding Netfilter-then libnetfilter_queue can help you.

In general, the performance of the code in the kernel is better, but for many projects, this performance loss is not serious.

Since kernel programming is always asynchronous, there is no main () function to make Linux execute your modules sequentially. Instead, you provide callback functions for various events, like this:

Static int _ _ init reverse_init (void) {printk (KERN_INFO "reverse device has been registered\ n"); return 0;} static void _ exit reverse_exit (void) {printk (KERN_INFO "reverse device has been unregistered\ n");} module_init (reverse_init); module_exit (reverse_exit)

Here, the functions we define are called module inserts and deletions. Only * insert functions are necessary. Currently, they just print messages to the kernel ring buffer (which can be accessed through the dmesg command in user space); KERN_INFO is at the log level (note that there is no comma). _ _ init and _ _ exit are properties-- slices of metadata that are linked to functions (or variables). Attributes are rare in C code in user space, but they are common in the kernel. All those marked _ _ init will release memory for reuse after initialization (remember the "Freeing unused kernel memory … [release unused kernel memory …]" from the past kernel.) Information? ). _ _ exit indicates that when the code is statically built into the kernel, the function can be safely optimized without the need to clean up the endings. The *, module_init () and module_exit () macros set the reverse_init () and reverse_exit () functions as the life cycle callback functions of our module. The actual function names don't matter. You can call them init () and exit (), or start () and stop (). You can call them whatever you want. They are all static statements that you can't see in external modules. In fact, any function in the kernel is not visible unless it is explicitly exported. Among kernel programmers, however, it is customary to prefix your functions with module names.

These are basic concepts-let's do more interesting things. The module can receive parameters, like this:

# modprobe foo bar=1

The modinfo command shows all the parameters accepted by the module, and these can also be used as files under / sys/module//parameters. Our module needs a buffer to store parameters-- let's set this size to user-configurable. Add the following three lines under MODULE_DESCRIPTION ():

Static unsigned long buffer_size = 8192 percent modulekeeper param (buffer_size, ulong, (S_IRUSR | S_IRGRP | S_IROTH)); MODULE_PARM_DESC (buffer_size, "Internal buffer size")

Here, we define a variable to store the value, encapsulate it into a parameter, and make it readable to everyone through sysfs. The description of this parameter (* line) appears in the output of modinfo.

Since the user can set buffer_size directly, we need to clear the invalid value in reverse_init (). You should always check data from outside the kernel-if you don't, you're exposing yourself to kernel exceptions or security vulnerabilities.

Static int _ _ init reverse_init () {if (! buffer_size) return-1 buffer size is printk (KERN_INFO "reverse device has been registered, buffer size is% printk\ n", buffer_size); return 0;}

A non-zero return value from the module initialization function means that the module execution failed.

Navigation

But when you develop modules, the Linux kernel is the source of everything you need. However, it is quite large and you may have difficulty finding what you want. Fortunately, in the face of a large code base, there are many tools to make this process simple. The first is Cscope, a more classic tool that runs on terminals. All you have to do is run make cscope & & cscope in the kernel source code's * * directory. Cscope integrates well with Vim and Emacs, so you can use it in your favorite editor.

If the terminal-based tool is not yours, visit http://lxr.free-electrons.com. It's a web-based kernel navigation tool, and even though it doesn't have as much functionality as Cscope (for example, you can't easily find the usage of a function), it still provides enough fast query capabilities.

Now it's time to compile the module. You need the kernel version header file (linux-headers, or equivalent package) and build-essential (or similar package) that you are running. Next, you should create a standard Makefile template:

Obj-m + = reverse.oall:make-C / lib/modules/$ (shell uname-r) / build Mask $(PWD) modulesclean:make-C / lib/modules/$ (shell uname-r) / build M é n $(PWD) clean

Now, call make to build your module. If everything you typed is correct, you will find the reverse.ko file in the current directory. Insert the kernel module using sudo insmod reverse.ko, and then run the following command:

$dmesg | tail-1 [5905.042081] reverse device has been registered, buffer size is 8192 bytes

Congratulations! At the moment, however, the business is just an illusion-there are no device nodes yet. Let's take care of it.

Hybrid equipment

In Linux, there is a special type of character device called "hybrid device" (or "misc" for short). It is designed for small device drivers with a single access point, which is exactly what we need. All hybrid devices share the same primary device number (10), so one driver (drivers/char/misc.c) can view all their devices, which are distinguished by secondary device numbers. In other sense, they are just ordinary character devices.

To register a secondary device number (and an access point) for the device, you need to declare struct misc_device, fill in all the fields (note the syntax), and then call misc_register () using a pointer to the structure as an argument. To do this, you also need to include the linux/miscdevice.h header file:

Static struct miscdevice reverse_misc_device = {.minor = MISC_DYNAMIC_MINOR,.name = "reverse", .fops = & reverse_fops}; static int _ _ init reverse_init () {... misc_register (& reverse_misc_device); printk (KERN_INFO.}

Here, we request a * available (dynamic) secondary device number for the device named "reverse"; the ellipsis indicates the omitted code we have seen before. Don't forget to log off the device after the module has been removed.

Static void _ exit reverse_exit (void) {misc_deregister (& reverse_misc_device);...}

The fops' field stores a pointer to a file_operations structure (declared in Linux/fs.h), which is the access point of our module. Reverse_fops is defined as follows:

Static struct file_operations reverse_fops = {.owner = THIS_MODULE,.open = reverse_open,....llseek = noop_llseek}

In addition, reverse_fops contains a series of callback functions (also known as methods) that are executed when user-space code opens a device, reads, writes, or closes the file descriptor. If you want to ignore these callbacks, you can specify an explicit callback function instead. That's why we set llseek to noop_llseek (), which (as the name implies) does nothing. This default implementation changes a file pointer, and we don't need our device to be addressable right now (this is the homework left for you today).

Close and open

Let's implement this method. We will assign a new buffer to each open file descriptor and release it when it is closed. This is actually not secure: if a user-space application leaks descriptors (perhaps intentionally), it will seize the RAM and make the system unavailable. In the real world, you always have to consider these possibilities. But in this tutorial, this approach doesn't matter.

We need a struct function to describe the buffer. The kernel provides many regular data structures: linked lists (double concatenated), hash tables, trees, and so on. However, buffers are often designed from scratch. We will call our "struct buffer":

Struct buffer {char * data, * end, * read_ptr;unsigned long size;}

Data is a pointer to a string stored in the buffer, while end points to * * bytes after the end of the string. Read_ptr is where read () starts reading data. The size of the buffer is stored for integrity-- we haven't used this area yet. You can't assume that the user using your structure will initialize all of these things correctly, so * * encapsulate the allocation and recovery of buffers in the function. They are usually named buffer_alloc () and buffer_free ().

Static struct buffer buffer_alloc (unsigned long size) {struct buffer * buf; buf = kzalloc (sizeof (buf), GFP_KERNEL); if (unlikely (! buf)) goto out; … Out: return buf;}

Kernel memory is allocated using kmalloc () and freed using kfree (); the style of kzalloc () is to set memory to all zero. Unlike the standard malloc (), its kernel counterpart receives flags that specify the type of memory requested in the second parameter. Here, GFP_KERNEL means that we need a normal kernel memory (not in DMA or high memory) and that the function can sleep (reschedule the process) if needed. Sizeof (* buf) is a common way to get the size of a structure that can be accessed through a pointer.

You should always check the return value of kmalloc (): accessing the NULL pointer will cause a kernel exception. You also need to pay attention to the use of the unlikely () macro. It (and its relative macro likely ()) is widely used in the kernel to indicate that the condition is almost always true (or false). It does not affect the control flow, but it can help modern processors improve performance through branch prediction technology.

*, pay attention to the goto statement. They are often considered evil, but the Linux kernel (and some other system software) uses them to implement centralized function exits. The result is less nesting depth, more readable code, and much like try-catch blocks in higher-level languages.

With buffer_alloc () and buffer_free (), the open and close methods become very simple.

Static int reverse_open (struct inode * inode, struct file * file) {int err = 0 buffer_size-> private_data = buffer_alloc (buffer_size);. Return err;}

Struct file is a standard kernel data structure that stores information about open files, such as the current file location (file- > f_pos), flag (file- > f_flags), or open mode (file- > f_mode). Another field, file- > privatedata, is used to associate a file with some proprietary data, which is of type void *, and it is opaque to the kernel outside the owner of the file. We store a buffer there.

If the buffer allocation fails, we mark the calling user-space code by returning a negative value (- ENOMEM). An open (2) system call called in a C library (such as glibc) will detect this and set the errno appropriately.

Learn how to read and write

The "read" and "write" methods are where the work is really done. When the data is written to the buffer, we discard the previous contents and store the field in reverse, without any temporary storage. The read method simply copies data from the kernel buffer to user space. But what does revers_eread () do if there is no data in the buffer? In user space, the read () call blocks data before it is available. In the kernel, you have to wait. Fortunately, there is a mechanism for dealing with this situation, which is' wait queues'.

The idea is simple. If the current process needs to wait for an event, its descriptor (struct task_struct stores' current' information) is put into a non-runnable (sleeping) state and added to a queue. Schedule () is then called to select another process to run. The code that generates events wakes up waiting processes by using queues to put waiting processes back into the TASK_RUNNING state. The scheduler will select one of them somewhere in the future. Linux has a variety of non-runnable states, most notably TASK_INTERRUPTIBLE (a sleep that can be interrupted by a signal) and TASK_KILLABLE (a sleep process that can be killed). All of this should be handled correctly and wait for the queue to do these things for you.

A natural place to store the read waiting queue header is the structural buffer, so start by adding the wait_queue_headt read\ queue field to it. You should also include the linux/sched.h header file. You can use the DECLARE_WAITQUEUE () macro to statically declare a waiting queue. In our case, dynamic initialization is required, so add the following line to buffer_alloc ():

Init_waitqueue_head (& buf- > read_queue)

We wait for available data, or wait for the read_ptr! = end condition to be established. We also want to allow waiting operations to be interrupted (for example, through Ctrl+C). Therefore, the "read" method should start like this:

Static ssize_t reverse_read (struct file * file, char _ user * out,size_t size, loff_t * off) {struct buffer * buf = file- > private_data;ssize_t result;while (buf- > read_ptr = = buf- > end) {if (file- > f_flags & O_NONBLOCK) {result =-EAGAIN;goto out;} if (wait_event_interruptible (buf- > read_queue, buf- > read_ptr! = buf- > end)) {result =-ERESTARTSYS;goto out;}}.

We let it loop until data is available, and if not, use wait_event_interruptible () (it is a macro, not a function, which is why it is passed to the queue by value) to wait. Well, if wait_event_interruptible () is interrupted, it returns a non-zero value, which represents-ERESTARTSYS. This code means that the system call should be restarted. File- > f_flags checks the number of files opened in non-blocking mode: if there is no data, returns-EAGAIN.

We cannot use if () instead of while () because there may be many processes waiting for data. When the write method wakes them up, the scheduler selects one to run in an unpredictable way, so the buffer may be empty again when this piece of code has a chance to execute. Now we need to copy the data from buf- > data to user space. The copy_to_user () kernel function does this:

Size = min (size, (size_t) (buf- > end-buf- > read_ptr); if (copy_to_user (out, buf- > read_ptr, size)) {result =-EFAULT;goto out;}

If the user space pointer is wrong, the call may fail; if that happens, we return-EFAULT. Remember, don't believe anything from outside the kernel!

Buf- > read_ptr + = size;result = size;out:return result;}

In order to make the data readable in any block, a simple operation is needed. This method returns the number of bytes read, or an error code.

The method of writing is shorter. First, we check to see if the buffer has enough space, and then we use the copy_from_userspace () function to get the data. Then the read_ptr and the end pointer are reset and the contents of the storage buffer are reversed:

Buf- > end = buf- > data + size;buf- > read_ptr = buf- > data;if (buf- > end > buf- > data) reverse_phrase (buf- > data, buf- > end-1)

Here, reverse_phrase () does all the hard work. It relies on the reverse_word () function, which is fairly short and marked inline. This is another common optimization; however, you should not overuse it. Because too much inlining can cause the kernel image to grow in vain.

*, we need to wake up the process of waiting for data in read_queue, as mentioned earlier. Wake_up_interruptible () is used to do this:

Wake_up_interruptible (& buf- > read_queue)

Yeah! You now have a kernel module that has at least been compiled. Now, it's time to test it.

Debug kernel code

Perhaps the most common debugging method in the kernel is printing. If you prefer, you can use normal printk () (assuming the KERN_DEBUG log level is used). However, there is a better way. If you are writing a device driver that has its own "struct device", you can use pr_debug () or dev_dbg (): they support dynamic debugging (dyndbg) features and can be enabled or disabled as needed (see Documentation/dynamic-debug-howto.txt). For pure development messages, use pr_devel (), and nothing will be done unless DEBUG is set. To enable DEBUG for our module, add the following line to Makefile:

CFLAGS_reverse.o: =-DDEBUG

When you're done, use dmesg to view the debugging information generated by pr_debug () or pr_devel (). Alternatively, you can send debugging information directly to the console. To do this, you can set the console_loglevel kernel variable to 8 or more (echo 8 / proc/sys/kernel/printk), or print debug information to query at a high log level, such as KERN_ERR. Naturally, you should remove such debug declarations before releasing the code.

Note that kernel messages appear in the console and should not be viewed in terminal emulator windows such as Xterm; this is why it is recommended that you do not work in the X environment during kernel development.

Surprise, surprise!

Compile the module and then load it into the kernel:

$make$ sudo insmod reverse.ko buffer_size=2048$ lsmodreverse 2419 0$ ls-l / dev/reversecrw-rw-rw- 1 root root 10, 58 Feb 22 15:53 / dev/reverse

Everything seems to be in place. Now, to test whether the module is working properly, we will write a Mini Program to flip its * command-line arguments. Main () (double-check for errors) might look like this:

Int fd = open ("/ dev/reverse", O_RDWR); write (fd, argv [1], strlen (argv [1])); read (fd, argv [1], strlen (argv [1])); printf ("Read:% s\ n", argv [1])

Run like this:

$. / test'A quick brown fox jumped over the lazy dog'Read: dog lazy the over jumped fox brown quick A

It works fine! Have fun: try passing a single word or letter phrase, an empty string or a non-English string (if you have such a keyboard layout setting), and anything else.

Now, let's make things a little more fun. We will create two processes that share a file descriptor (and its kernel buffer). One of them will continue to write strings to the device, while the other will read them. In the following example, we use the fork (2) system call, and pthreads is also easy to use. I also omitted the code to turn the device on and off, and checked for code errors here (again):

Char * phrase = "A quick brown fox jumped over the lazy dog"; if (fork ()) / * Parent is the writer * / while (1) write (fd, phrase, len); else/* child is the reader * / while (1) {read (fd, buf, len); printf ("Read:% s\ n", buf);}

What do you hope this program will output? Here's what I got in my notebook:

Read: dog lazy the over jumped fox brown quick ARead: A kcicq brown fox jumped over the lazy dogRead: A kciuq nworb xor jumped fox brown quick ARead: A kciuq nworb xor jumped fox brown quick A...

What's going on here? It's like holding a game. We think of read and write as atomic operations, or executing one instruction at a time. However, the kernel is indeed out of order and concurrency, casually rescheduling the kernel part of the write operation running somewhere inside the reverse_phrase () function. What if the read () operation is scheduled before the write operation ends? It will produce a state of incomplete data. Such bug is very difficult to find. But how to deal with this problem?

Basically, we need to make sure that no read method can be executed before the write method returns. If you have ever written a multithreaded application, you may have seen synchronization primitives (locks), such as mutexes or signals. Linux has these, too, but with some nuances. Kernel code can run the process context (the "representative" work of user-space code, like the method we use) and the terminal context (for example, an IRQ processing thread). If you are already in the process context and you have got the locks you need, you just need to simply sleep and retry until you succeed. You cannot sleep when interrupting the context, so the code runs in a loop until the lock is available. Associative primitives are called spin locks, but in our environment, a simple mutex-- an object that only one process can "occupy" at a given time-- is sufficient. For performance reasons, real code may also use read-write signals.

Locks always protect certain data (an instance of "struct buffer" in our environment) and often embed them in the structures they protect. Therefore, we add a mutex ('struct mutex lock') to "struct buffer". We must also initialize the mutex with mutex_init (); buffer_alloc is a good place to handle this. Code that uses mutexes must also contain linux/mutex.h.

Mutexes are a lot like traffic lights-it's useless if the driver doesn't look at it and don't listen to it. Therefore, before operating on the buffer and releasing it when the operation is complete, we need to update reverse_read () and reverse_write () to acquire the mutex. Let's take a look at the read method-- write works the same way:

Static ssize_t reverse_read (struct file * file, char _ _ user * out,size_t size, loff_t * off) {struct buffer * buf = file- > private_data;ssize_t result;if (mutex_lock_interruptible (& buf- > lock)) {result =-ERESTARTSYS;goto out;}

We acquire the lock at the beginning of the function. Mutex_lock_interruptible () either gets the mutex and returns, or it puts the process to sleep until a mutex is available. As before, the _ interruptible suffix means that sleep can be interrupted by a signal.

While (buf- > read_ptr = = buf- > end) {mutex_unlock (& buf- > lock); / *. Wait_event_interruptible () here... * / if (mutex_lock_interruptible (& buf- > lock)) {result =-ERESTARTSYS;goto out;}}

Here is our "wait for data" loop. When acquiring mutexes, or when there is a situation called a "deadlock", you should not let the process sleep. So, if there is no data, we release the mutex and call wait_event_interruptible (). When it returns, we re-acquire the mutex and continue as usual:

If (copy_to_user (out, buf- > read_ptr, size)) {result =-EFAULT;goto out_unlock;}... out_unlock:mutex_unlock (& buf- > lock); out:return result

The mutex is unlocked when the function ends or when an error occurs during the acquisition of the mutex. Recompile the module (don't forget to reload), and then test it again. You should not have found the destroyed data by now.

Thank you for your reading, the above is the content of "how to write Linux kernel module". After the study of this article, I believe you have a deeper understanding of how to write Linux kernel module, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.