How to inject Icano malfunction 04/20 Update SLTechnology News&Howtos

How to inject Icano malfunction

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Most people don't quite understand the knowledge points of this article "how to inject IWeiO failure", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can gain something after reading this article. Let's take a look at this article "how to inject IWeibo fault".

In the production environment, file system errors often occur because of disk failure, misoperation and other reasons. Chaos Mesh has long provided the ability to inject file system errors. Users only need to add an IOChaos resource to make the file system operation on the specified file fail or return the wrong data. Before Chaos Mesh 1.0, using IOChaos required injecting sidecar containers into Pod and rewriting startup commands; even if there were no injection errors, the containers injected into sidecar always had significant performance overhead. With the release of Chaos Mesh 1.0, it provides the ability to inject file system errors at run time, making IOChaos as easy to use as all other types of Chaos.

Front position

The content of this article assumes that you already have the following knowledge. Of course, you don't have to read it right now, but you can look back and learn when you come across a noun you haven't seen before.

I will try my best to provide relevant learning materials, but I will not refine and repeat them, first, because these knowledge can be learned through a simple Google; second, because most of the time, the effect of learning first-hand knowledge is far better than that of second-hand knowledge, and the effect of learning n-hand knowledge is much better than that of (nasty 1) hands.

FUSE. Wikipedia, man (4)

Mount_namespaces. Man, k8s Mount propagation

X86 assembly language. Wikipedia

Mount. Man (2) especially MS_MOVE

Mutating admission webhooks. K8s Document

Syscall. Man (2) take a look at the calling convention

Ptrace. Man (2)

Device node, char devices Device names, device nodes, and major/minor numbers

Reading articles related to TimeChaos is also very helpful in understanding this article, because they use similar techniques.

In addition, I hope that when reading this document, the reader will actively think about the cause and effect of each step. There is no complex knowledge that requires the brain to run at a high speed, only a step-by-step guide to action. I also hope you can keep thinking in your head, "what should I do if I want to implement runtime filesystem injection myself?" In this way, this article will change from simple indoctrination to the exchange of ideas, which will be much more interesting.

Error injection

A common way to find error injection is to first observe the call path when it is not injected: during the implementation of TimeChaos, by observing the way the application gets the time, we know that most programs will access the time through vDSO, so we choose the way to modify part of the memory of the target program vDSO to modify the time.

So is there a breakthrough for injection when the application initiates system calls such as read, write, and these requests reach the target file system? As a matter of fact, there is. You can inject the relevant system call using bpf, but it cannot be used to inject delay. Another way is to add another layer of file system in front of the target file system, which we call ChaosFS for the time being:

ChaosFS takes the original target file system as the back end and accepts write requests from the operating system, making the whole call link Targer Program syscall-> Linux Kernel-> ChaosFS-> Target Filesystem. Because we can customize the implementation of the ChaosFS file system, we can add delays and return errors at will.

If you are already thinking about your own filesystem error injection implementation at this point, you must have found some problems:

If ChaosFS also reads and writes files to the target file system, this means that its mount path is different from that of the destination folder. Because the mount path is almost the only way to access a file system.

That is, if the target program wants to write / mnt/a, so ChaosFS has to be mounted on / mnt/a, then the target folder cannot be / mnt/a! But the configuration of pod says to mount the target file system on / mnt, what to do?

This does not meet the requirements of runtime injection. Because if the target program has already opened some files of the original target system, then the newly mounted file system is only valid for the files of the new open. (not to mention the problem of file system path overwriting mentioned above). To be able to inject file system errors into the target program, you must mount the ChaosFS before the target process starts.

We also have to find a way to mount the file system into the mnt namespace of the target container.

For these three questions, the original IOChaos is achieved by using Mutating Webhook:

Use Mutating Webhook to first run the script in the target container to move the directory. For example, move / mnt/a to / mnt/a_bak. In this way, the storage back end of ChaosFS can be / mnt/a_bak directory, and mount it under / mnt/a.

Use Mutating Webhook to modify the startup command of Pod, for example, if the startup command is / app, we need to change it to / waitfs.sh / app, and the waitfs.sh provided by us will constantly check whether the file system has been mounted successfully, and start / app again if it has been successfully mounted.

Naturally, we still use Mutating Webhook to add an extra container to Pod to run ChaosFS. The container running ChaosFS needs to share some volume, such as / mnt, with the target container. Then mount it to the target directory, such as / mnt/a. At the same time, open the appropriate mount propagation to allow the mount in the volume of the ChaosFS container to share to host, and then slave to the target by host. If you know mnt namespace and mount, you must know what share and slave mean.

In this way, the injection of the target program IO process is completed. But it's so hard to use:

You can only inject a subdirectory of a volume, not the entire volume.

It is required that command is written in the plaintext of Pod, rather than implicitly using mirrored command. Because if you use the command implied in the image, / waitfs.sh will not know how to start the application after the mount is successful.

The corresponding container is required to have sufficient mount propagation configuration. Of course we can add it secretly in Mutating Webhook, but moving the user's container is always not good (and may even cause security problems).

There are too many things to fill in the injection configuration! It's troublesome to configure. And after the configuration is complete, a new pod must be created before it can be injected.

You cannot exit ChaosFS at run time, so even without delays or errors, it still has a significant impact on performance.

The first problem can be overcome, as long as mount move is used instead of mv (rename), the mount point of the target volume can be moved. The last few problems are not so easy to overcome.

Runtime injection error

Combine other knowledge you have (such as the knowledge of namespace and the use of ptrace), re-examine these two points, and you can find a solution. We relied entirely on Mutating Webhook to construct this implementation, but most of the bad things were also brought about by Mutating Webhook's approach. If you like, you can call this method the Sidecar method. Many projects are called this, but this title will be hidden, and do not save too many words, I do not like it very much). Next we will show how to do this without using Mutating Webhook.

Invade the namespace

The purpose of using Mutating Webhook to add a container for running ChaosFS is to mount the file system into the target container through mount propagation's mechanism. This is not the only option to achieve this-- we can also directly use the setns system call provided by Linux to modify the namespace of the current process. In fact, most implementations of Chaos Mesh use nsenter commands, setns system calls, and so on, to enter the namespace of the target container instead of adding the container to the Pod. This is because the former is more convenient to use and more flexible to develop.

In other words, you can first let the current thread enter the mnt namespace of the target container through setns, and then call system calls such as mount to complete the mount of the ChaosFS in this namespace.

Suppose the file system we need to inject is / mnt:

Let the current thread enter the mnt namespace of the target container through setns

Move / mnt to / mnt_bak via mount-- move

Mount ChaosFS to / mnt with / mnt_bak as the storage backend.

As you can see, the injection process is almost complete at this point, and if the target container opens again and reads and writes files in / mnt, it will pass through the ChaosFS, thus injecting delays or errors into it.

And it still has two problems:

What about files that are already open by the target process?

How to recover? After all, it is not possible to umount when a file is opened.

Later, we will use the same approach to solve these two problems: use the ptrace method to replace the fd that is already open at run time. (this article takes fd as an example. In fact, in addition to fd, cwd,mmap needs to be replaced, and the implementation is similar, so it will not be described separately.)

Dynamic replacement of fd

We mainly use ptrace to dynamically replace fd. Before introducing specific methods, we might as well feel the power of ptrace:

How can you use ptrace to make tracee (the thread of the ptrace) run any system call? It is not difficult to use the knowledge of ptrace and x86 / 64 to see this problem. Because ptrace can modify the register, and the rip register (instruction pointer) in the x86room64 architecture always points to the address of the next instruction to run, you only need to change part of the memory pointed to by the current rip to 0x050f (corresponding to the syscall instruction), and then set the value of each register to the corresponding system call number or parameter according to the calling convention of the system call, and then step through using ptrace. You can get the return value of the system call from the rax register. Remember to recover both the register and the modified memory after completing the call.

Functions such as POKE_TEXT,SETREGS,GETREGS,SINGLESTEP of ptrace are used in the above process. If you are not familiar with it, you can consult the manual of ptrace.

Using ptrace allows tracee (the target process of ptrace) to run arbitrary binary programs.

The idea of running any binary program is similar. As with running system calls, you can change the internal training of the latter part of rip to the program you want to run, and add an int3 instruction at the end of the program to trigger the breakpoint. It is good to recover the registers and memory of the target program after the execution is completed.

In fact, we can choose a slightly cleaner approach: use ptrace to call mmap in the target program, allocate the needed memory, and then write the binary to the newly allocated memory area, pointing the rip to it. Calling munmap at the end of the run keeps the memory area clean.

In practice, we use process_vm_writev instead of ptrace POKE_TEXT writing, which is more stable and efficient when writing large amounts of content.

With the above means, if a process has its own way to replace its own fd, then through ptrace, it can run the same program to replace fd. This makes the problem simple: we just need to find a way for the process to replace its own fd. If you are familiar with Linux's system calls, you will find the answer right away: dup2.

Replace fd with dup2

The function signature of dup2 is int dup2 (int oldfd, int newfd); its function is to create a copy of oldfd, and the fd number of this copy is newfd. If newfd already has an open fd, it will be automatically close.

Suppose the process is now open / var/run/__chaosfs__test__/a, fd is 1, and you want to replace it with / var/run/test/a, then what it needs to do is:

Use the OFlags that gets / var/run/__chaosfs__test__/a through the fcntl system call (that is, the parameter of the open call, such as O_WRONLY)

Use the lseek system call to get the current seek location

Using the open system call, open / var/run/test/a with the same OFlags, assuming fd is 2

Use lseek to change the seek location of the newly opened fd 2

Use dup2 (2, 1) to replace fd 1 of / var/run/__chaosfs__test__/a with the newly opened fd 2

Turn off fd 2.

After that, fd 1 of the current process will point to / var/run/test/a, and any operation on it will pass through FUSE and can be injected with errors.

Use ptrace to have the target process run a program that replaces fd

So as long as you combine the knowledge of "using ptrace to make tracee run arbitrary binary programs" and the method of "using dup2 to replace the fd you have opened", you can let tracee replace the open fd itself!

Compared to the steps described above, combined with the use of the syscall instruction, it is easy to write the corresponding assembly code. You can see the corresponding source code here, and you can output it as a usable binary program using the assembler (we are using dynasm-rs). Then use ptrace to let the target process run this program, and the replacement of fd at run time is completed.

Readers can think a little bit about how to change cwd and replace mmap in a similar way. Their processes are completely similar.

Note: the implementation assumes that the target program is in accordance with Posix Thread and that the open files are shared between the target process and its threads, that is, clone specifies CLONE_FILES when creating threads. So only the first thread of a thread group will be replaced by fd.

Process overview

After understanding all these technologies, the idea of implementing a run-time file system should gradually become clear. In this section I will directly show the flowchart of the entire injection implementation:

Parallel lines represent different threads, from left to right in chronological order. You can see that it is necessary to arrange the tasks of "mounting / unmounting the file system" and "replacing resources such as fd" in a more detailed sequence. Why? If the reader's understanding of the whole process is clear enough, try to think about the answer for yourself.

Minor issues mmap failures that may be caused by mnt namespace

Is the created mmap still valid after the mnt namespace switch? For example, a mmap points to / a mnt namespace b, and / a mnt namespace b disappears after switching. Will it cause an unexpected crash when accessing the mmap again? It is worth noting that dynamic link libraries are all loaded into memory through mmap, so is it a problem to access them?

As a matter of fact, there will be no problem. This involves the manner and purpose of mnt namespace. Mnt namespace is only concerned with the control of thread visibility. The specific method is to modify the modification of the vfsmount pointer in the task_struct of a thread in the kernel when calling setns, so that when a thread uses system calls of any incoming path (such as open, rename, etc.), the Linux kernel queries the file from the pathname (as the file structure) through vfsmount, which will be affected by namespace. For a fd that has been opened (pointing to a file structure), its open, write, read and other operations directly point to the function pointer of the corresponding file system, and will not be affected by namespace; for an open mmap (pointing to an address_space structure), its writepage, readpage and other operations also point directly to the function pointer of the corresponding file system, and are not affected by namespace.

Range of injection

Since it is not possible to pause all processes running on the machine and check for resources such as fd and mmap that they have opened during the injection process, the overhead of doing so is unacceptable. In practice, we choose to pre-enter the pid namespace of the target container and pause and check all processes that can be seen in this namespace.

So the scope of injection and recovery is all processes in pid namespace. Switching pid namespace means that you need to pre-set the pid namespace of the child process and then clone (because Linux does not allow switching the pid namespace of the current process), which will bring a lot of problems.

Switching namespace has some restrictions on clone flag

Switching mnt namespace will not allow clone to carry the parameter CLONE_FS. When the child process pid namespace is set in advance, clone will not be allowed to carry the parameter CLONE_THREAD. In order to deal with this problem, we choose to modify the source code of glibc. We can find the source code of the modified glibc in chaos-mesh/toda-glibc. Only the parameters passed in when the pthread part clone is modified.

After excluding CLONE_THREAD and CLONE_FS, the performance of pthread is quite different from that before. The biggest difference is that the new pthread thread is no longer the tasks of the original process, but a new process, and their tgid is different. In this way, the relationship between pthread threads changes from process to tasks to process and child process. This can lead to some problems, such as the need for additional cleanup of child processes when exiting.

In lower versions of the kernel, processes with different pid namespace are not allowed to share SIGHAND, so CLONE_SIGHAND needs to be removed.

Why not use nsenter

In chaos-daemon, many operations that need to be in the target namespace are done through nsenter, such as nsenter iptables. Nsenter, on the other hand, cannot cope with the IOChaos scenario, because if you have entered the target mnt namespace when the process starts, you will not find suitable dynamic link libraries (such as libfuse.so and homemade glibc).

Construction / dev/fuse

Since there is not necessarily / dev/fuse in the target container (in fact, it is more likely not), you will encounter an error when you mount the FUSE after entering the mnt namespace of the target container. So you need to construct / dev/fuse after entering the target's mnt namespace. This construction process is easy because the major number and minor number of fuse are fixed 10 and 229. So you can create / dev/fuse as long as you use the makedev function and the mknod system call.

The problem of waiting for the child process to die after removing the CLONE_THREAD

When a child process dies, a SIGCHLD signal is sent to the parent process to notify itself of its death. If the parent process does not handle this signal properly (explicitly ignored or wait in signal processing), then the child process will remain in the defunct state.

In our scenario, the problem becomes more complicated: when the parent of a process dies, its parent is reset to process 1 of its pid namespace. Generally speaking, a good init process (such as systemd) will be responsible for cleaning up these defunct processes, but in the container scenario, the application as pid 1 is not designed to be a good init process and will not be responsible for handling these defunct processes.

To solve this problem, we use subreaper's mechanism to make the parent process of a process die not by directly setting the parent process to 1, but by the nearest subreaper on the process tree. Then use wait to wait for all child processes to die before exiting.

Waitpid behaves differently in different kernel versions

Waitpid behaves differently in different versions of the kernel. In an earlier version of the kernel, using waitpid on a tracee that is a child thread (that is, a thread that is not the main thread) returns ECHILD. The reason for this has not been determined, and no relevant documentation has been found.

The above is about the content of this article on "how to inject Igambo O failure". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.