In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
How to deeply analyze the Linux container, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
Container
Container is a lightweight virtualization technology because it has one less layer of hypervisor than virtual machines. Take a look at the following diagram, which briefly describes the startup process of a container.
At the bottom is a disk, and the image of the container is stored on it. The upper layer is a container engine, which can be docker or other container engines. The engine sends a request, such as creating a container, and it mirrors the container on the disk as a process on the host.
For the container, the most important thing is how to ensure that the resources used by this process are isolated and restricted, which is guaranteed by cgroup and namespace technologies on the Linux kernel. Next, take docker as an example to introduce the two parts of resource isolation and container image in detail.
Resource isolation and restriction namespace
Namespace is used for resource isolation, and the first six of the seven namespace,docker on the Linux kernel are used. The seventh cgroup namespace is not used in docker itself, but cgroup namespace is implemented in the runC implementation.
Let's take a look at it from the beginning:
The first one is mout namespace. Mout namespace is to ensure that the view of the file system seen by the container is a file system provided by the container image, that is, it cannot see other files on the host. Except for the mode through the-v parameter bound, it is possible to make some directories and files on the host visible in the container.
The second is uts namespace. This namespace mainly isolates hostname and domain.
The third is pid namespace. This namespace ensures that the container's init process is started with process 1.
The fourth is the network namespace. Except for the container host network mode, all other network models have their own network namespace file.
The fifth is user namespace, which is a namespace that controls the mapping of users UID and GID inside the container and on the host, but this namespace is rarely used.
The sixth is IPC namespace, this namespace is something that controls the process and communication, such as semaphores.
The seventh is cgroup namespace, and there are two diagrams on the right of the above image, which show how to turn cgroup namespace on and off. One benefit of using cgroup namespace is that the cgroup view seen in the container is rendered as a root, which is the same as a view of cgroup namespace seen by the process above the host; another benefit is that it is more secure to use cgroup inside the container.
Here we briefly use unshare to illustrate the process of creating namespace. The creation of namespace in the container is actually created by using the system call unshare.
The top half of the figure is an example of the use of unshare, and the bottom half is an pid namespace that I actually created with the command unshare. You can see that the bash process is already in a new pid namespace, and then ps sees that the pid of this bash is now 1, indicating that it is a new pid namespace.
Two cgroup drivers for cgroup
Cgroup is mainly for resource constraints. There are two cgroup drivers for docker containers: one is systemd and the other is cgroupfs.
Cgroupfs is easier to understand. For example, how much memory should be limited and how much CPU share should be used? In fact, just write pid to a corresponding cgroup file, and then write the corresponding resources that need to be restricted into the corresponding memory cgroup file and CPU's cgroup file.
The other is a cgroup driver for systemd. This driver is because systemd itself can provide a way to manage cgroup. So if you use systemd as a cgroup driver, all write cgroup operations must be done through the interface of systemd, and the file of cgroup cannot be changed manually.
Cgroup commonly used in containers
Next, take a look at the cgroup commonly used in containers. The Linux kernel itself provides many kinds of cgroup, but the docker container uses only the following six:
The first is that CPU,CPU generally sets cpu share and cupset to control the utilization of CPU.
The second is memory, which controls the amount of process memory used.
The third device, device, controls the device devices that you can see in the container.
The fourth freezer. It and the third cgroup (device) are for security. When you stop the container, freezer writes all the current processes to cgroup and then freezes all processes. The purpose of this is to prevent any processes from doing fork when you stop. In this way, it is equivalent to preventing the process from escaping to the host, for security reasons.
The fifth is that blkio,blkio mainly limits some IOPS of the disks used by the container as well as the rate limit of bps. Because if cgroup is not unique, blkio can only limit synchronous io,docker io. There is no way to limit it.
The sixth is that pid cgroup,pid cgroup limits the maximum number of processes that can be used in the container.
Uncommonly used cgroup
There are also some cgroup that are not used by the docker container. The difference between commonly used and uncommonly used containers is for docker, because for runC, except for the bottom rdma, all cgroup is actually supported in runC, but docker does not enable this part of support, so the docker container does not support the cgroup shown below.
2. Container image docker images
Next, let's talk about the container image. Take the docker image as an example to talk about the composition of the container image.
Docker mirroring is based on the federated file system. A brief description of the federated file system means that it allows files to be stored at different levels, but ultimately all files at these levels can be seen through a unified view.
As shown in the image above, on the right is a structure diagram of the container stored from docker's official website.
This diagram vividly shows the storage of docker, which is based on the federated file system and is layered. Each layer is a Layer, and these Layer are made up of different files that can be reused by other images. As you can see, when the image is run as a container, the top layer will be the read-write layer of the container. The read-write layer of this container can also be turned into the latest layer at the top of an image through commit.
Docker image storage, its underlying is based on different file systems, so its storage driver is also customized for different file systems, such as AUFS, btrfs, devicemapper and overlay. Docker makes some corresponding graph driver drivers for these file systems, through which the image is stored on the disk.
Take overlay as an example to store the flow
Next, let's take the file system overlay as an example to see how docker images are stored on disk.
Take a look at the following diagram, which briefly describes how the overlay file system works.
The lowest layer is a lower layer, the mirror layer, which is a read-only layer
On the upper right is a upper layer, and upper is the read-write layer of the container. The upper layer uses a realistic copy mechanism, that is, only when certain files need to be modified will the file be copied from the lower layer, and then all the modification operations will modify the copy of the upper layer.
Juxtaposed by upper, there is a workdir that acts as a middle tier. In other words, when you modify a copy in the upper layer, it will first be put into workdir, and then moved from workdir to upper. This is how overlay works.
At the top is mergedir, which is a unified view layer. From mergedir, you can see the integration of all the data in upper and lower, and then we docker exec into the container and see that a file system is actually the mergedir unified view layer.
File operation
Next, let's talk about how to operate the files in the container based on the storage of overlay.
Take a look at the read operation first. when the container was created, the upper was actually empty. If you read it at this time, all the data is read from the lower layer.
Write operation as just mentioned, overlay's upper layer has a realistic data mechanism. When you need to operate on some files, overlay will do a copy up action, and then copy the file from the lower layer, and then some write modifications will operate on this part.
Then take a look at the delete operation. There is no real delete operation in overlay. Its so-called deletion is actually by marking the file, and then looking at it from the top unified view layer, seeing that if the file is marked, it will let the file be displayed, and then think that the file has been deleted. There are two ways to mark this:
One is the whiteout way.
The second is to delete the directory by setting an extended permission of the directory and setting the extension parameters.
Operation steps
Next, take a look at the container that actually uses docker run to launch busybox. What is the mount point of its overlay?
The second figure is mount, and you can see a mount of the container rootfs, which is a type of overlay as mounted. It includes three levels: upper, lower and workdir.
Then take a look at the writing of new files in the container. Docker exec to create a new file, diff this can be seen from above, is one of its upperdir. Look at the file in upperdir, and the contents of the file are also written by docker exec.
Finally, take a look at the bottom of the mergedir,mergedir is the integration of upperdir and lowerdir content, we can also see the data we wrote.
3. Container engine containerd Container Architecture
Next, let's talk about the general composition of the container engine based on containerd on one of CNCF's container engines. The following picture is an architecture diagram taken from containerd's official website, based on which we first briefly introduce the architecture of containerd.
If the image above divides it into the left and right sides, it can be considered that containerd provides two major functions.
The first is for runtime, that is, the management of the container life cycle, the part of storage on the left is actually the management of a mirror storage. Containerd will be responsible for pulling and storing images.
According to the horizontal level:
The first layer is that GRPC,containerd provides services to the upper layer in the form of GRPC serve. This part of Metrics mainly provides some content of cgroup Metrics.
On the left side of the lower layer is a storage of the container image. Below the center line images and containers is Metadata. This part of Matadata is stored on disk through bootfs. The Tasks on the right is the container structure that manages the container. Events means that for some operations of the container, an Event is sent to the upper layer, and then the upper layer can subscribe to the Event, thus knowing the change in the state of the container.
The lowest layer is the Runtimes layer, and this Runtimes can be distinguished by type, such as runC or a secure container.
What is shim v1/v2?
Next, let's talk about the general architecture of containerd on the runtime side. The following picture is taken from the official website of kata. The top half is the original image, and the bottom part adds some extension examples. Based on this picture, let's take a look at the architecture of containerd at the runtime layer.
As shown in the figure: a process that runs from the upper layer to the final runtime in an order from left to right.
Let's take a look at the far left, and on the far left is a CRI Client. Typically, kubelet sends a request to containerd through a CRI request. After receiving the request from the container, containerd passes through a containerd shim. Containerd shim manages the container life cycle and is responsible for two main aspects:
The first is that it will forward the io.
The second is that it transmits the signal.
The top half of the picture shows the security container, which is a process of kata, which is not specifically expanded. In the lower part, you can see that there are a variety of different shim. Let's introduce the architecture of containerd shim.
At first there was only one shim in containerd, that is, the blue framed containerd-shim. What this process means is that the shim used above is containerd, whether it is a kata container, a runc container, or a gvisor container.
Later, an extension is made for different types of runtime,containerd. This extension is done through the shim-v2 interface, that is, as long as the interface of the shim-v2 is implemented, different runtime can customize different shim. For example: runC can make a shim, and shim-runc;gvisor can make a shim called shim-gvisor;. Kata can also make a shim of shim-kata. These shim can replace the containerd-shim in the blue box above.
There are many advantages to doing so, to give a more vivid example. You can take a look at the picture of kata, it used to be if you use shim-v1, there are actually three components, the reason for having three components is because of a limitation of kata itself, but after using the architecture of shim-v2, the three components can be made into a binary, that is, the original three components, which can now become a shim-kata component, which can reflect a benefit of shim-v2.
Containerd Container Architecture details-Container process example
Let's use two examples to explain in detail how the container process works. The following two diagrams are based on the containerd architecture to draw the workflow of a container.
Start process
First, take a look at the process of container start:
This picture consists of three parts:
The first part is the container engine part, which can be docker or something else.
Containerd and containerd-shim framed by two dotted lines, which are part of the containerd architecture
At the bottom is the part of container, which is pulled up by a runtime and can be thought of as a container created by shim to manipulate the runC command.
Let's take a look at how this process works, and the figure also shows 1, 2, 3, 4. This 1, 2, 3, 4 is the process of how containerd creates a container.
First it will create a matadata, and then it will send a request to task service to create a container. Through a series of components in the middle, the request is finally sent to a shim. In fact, the interaction between containerd and shim is also done through GRPC. After containerd sends the creation request to shim, shim will call runtime to create a container. This is an example of container start.
Exec process
Next, take a look at the following picture of how to exec a container.
And start process is very similar, the structure is about the same, the difference is actually how containerd handles this part of the process. Like the figure above, I also marked 1, 2, 3, 4 in the diagram, and these steps represent an order in which containerd does exec.
As you can see from the figure above: the operation of exec is still sent to containerd-shim. For containers, there is no essential difference between start a container and exec a container.
The final difference is whether or not to create a namespace for the process running in the container.
When you exec, you need to add this process to an existing namespace
When start, the namespace of the container process needs to be specially created.
Here is a brief summary of this article:
How to use namespace for resource isolation and cgroup for resource restriction for containers
This paper briefly introduces the container image storage based on overlay file system.
Taking docker+containerd as an example, this paper introduces how the container engine works.
This is the answer to the question about how to analyze the Linux container in depth. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.