Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand the Namespace mechanism

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

Today, the editor will share with you the relevant knowledge of how to understand the Namespace mechanism. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

Namespace

Linux Namespace is a kernel-level method of environment isolation provided by Linux. This isolation mechanism is very similar to chroot, where chroot modifies a directory to the root directory so that external content cannot be accessed. On this basis, Linux Namesapce provides isolation mechanisms for UTS, IPC, Mount, PID, Network, User, and so on, as shown below.

Classification system call parameters related kernel version Mount NamespacesCLONE_NEWNSLinux 2.4.19UTS NamespacesCLONE_NEWUTSLinux 2.6.19IPC NamespacesCLONE_NEWIPCLinux 2.6.19PID NamespacesCLONE_NEWPIDLinux 2.6.19Network NamespacesCLONE_NEWNET begins with Linux 2.6.24, completes with Linux 2.6.29User NamespacesCLONE_NEWUSER, starts with Linux 2.6.23, completes with Linux 3.8)

★ Linux Namespace official documentation: Namespaces in operation "

Namespace has three system calls that can be used:

Clone ()-implements the thread's system call to create a new process, which can be isolated by designing the above parameters.

Unshare ()-detach a process from a namespace

Setns (int fd, int nstype)-add a process to a namespace

Use these system calls to demonstrate the effect of Namespace. You can look at DOCKER basic technology in more detail: LINUX NAMESPACE (top), DOCKER basic technology: LINUX NAMESPACE (bottom).

UTS Namespace

UTS Namespace is mainly used to isolate hostnames, that is, each container has its own hostname. We use the following code to demonstrate. Note: if the hostname is not set inside the container, the hostname of the host will be used; if the hostname is set inside the container but not CLONE_NEWUTS, then the hostname of the host will be changed.

# define _ GNU_SOURCE # include # define STACK_SIZE (1024 * 1024) static char container_ Stack [stack _ SIZE]; char* const container_args [] = {"/ bin/bash", NULL}; int container_main (void* arg) {printf ("Container [% 5d]-inside the container!\ n", getpid ()); sethostname ("container_dawn", 15) Execv (container_args [0], container_args); printf ("Something's wrong!\ n"); return 1;} int main () {printf ("Parent [% 5d]-start a container!\ n", getpid ()); int container_id = clone (container_main, container_stack + STACK_SIZE, CLONE_NEWUTS | SIGCHLD, NULL); waitpid (container_id, NULL, 0) Printf ("Parent-container stopped!\ n"); return 0;}

PID Namespace

Each container has its own process environment, that is, the PID equivalent to the process in the container is named from 1. At this time, the PID on the host is actually named from 1, which means that there are two process environments: one on one host and one in the other container.

Why is PID equivalent to the isolation of the process environment from 1? Therefore, in the traditional UNIX system, the process with PID of 1 is init, which has a special status. As the parent of all processes, it has many privileges. In addition, it checks the status of all processes, and we know that if a process breaks away from the parent process (the parent process does not wait it), then init is responsible for reclaiming resources and terminating the child process. So in order to achieve process isolation, we first need to create a process with a PID of 1.

But, [words in kubernetes]

Int container_main (void* arg) {printf ("Container [% 5d]-inside the container!\ n", getpid ()); sethostname ("container_dawn", 15); execv (container_args [0], container_args); printf ("Something's wrong!\ n"); return 1;} int main () {printf ("Parent [% 5d]-start a container!\ n", getpid ()) Int container_id = clone (container_main, container_stack + STACK_SIZE, CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD, NULL); waitpid (container_id, NULL, 0); printf ("Parent-container stopped!\ n"); return 0;}

If you type ps, top and other commands into the shell of the child process, we can still see all the processes. This is because the commands ps and top read / proc the file system, and because the file system is not isolated at this time, the parent and child processes see the same thing through the command.

IPC Namespace

Common IPC includes shared memory, semaphores, message queues and so on. When you use IPC Namespace to isolate the IPC, only processes under the same Namespace can communicate with each other, because the host's IPC and the IPC in other Namespace are invisible. The main reason for this isolation is that the created IPC will have a unique ID, so it would be nice to isolate the ID.

To start IPC isolation, you just need to add the CLONE_NEWIPC parameter when you call clone.

Int container_main (void* arg) {printf ("Container [% 5d]-inside the container!\ n", getpid ()); sethostname ("container_dawn", 15); execv (container_args [0], container_args); printf ("Something's wrong!\ n"); return 1;} int main () {printf ("Parent [% 5d]-start a container!\ n", getpid ()) Int container_id = clone (container_main, container_stack + STACK_SIZE, CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWIPC | SIGCHLD, NULL); waitpid (container_id, NULL, 0); printf ("Parent-container stopped!\ n"); return 0;}

Mount Namespace allows the container to have its own root file system. It is important to note that after creating the mount namespace through CLONE_NEWNS, the parent process copies its own file structure to the child process. So when there is no re-mount in the child process, the file system view of the child process is the same as that of the parent process. If you want to change the view of the container process, you must re-mount (this is where mount namespace differs from other namespace).

In addition, all mount operations in the new namespace in the child process only affect its own file system (note that this is the mount operation, while file creation and other operations will have an impact) without any impact on the outside world, so that it can be more strictly isolated (except share mount, of course).

Let's reload the / proc directory of the child process so that we can use ps to see what's going on inside the container.

Int container_main (void* arg) {printf ("Container [% 5d]-inside the container!\ n", getpid ()); sethostname ("container_dawn", 15); if (mount ("proc", "/ proc", "proc", 0, NULL)! = 0) {perror ("proc");} execv (container_args [0], container_args); printf ("Something's wrong!\ n") Return 1;} int main () {printf ("Parent [% 5d]-start a container!\ n", getpid ()); int container_id = clone (container_main, container_stack + STACK_SIZE, CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL); waitpid (container_id, NULL, 0); printf ("Parent-container stopped!\ n"); return 0;}

The problem with ★ here is that after exiting the child process, an error will be reported when you use ps-elf again, as shown below

This is because / proc is share mount, and operations on it affect all mount namespace, as shown here: http://unix.stackexchange.com/questions/281844/why-does-child-with-mount-namespace-affect-parent-mounts"

The above only re-mount the / proc directory, and the other directories are viewed in the same way as the parent process. In general, after the container is created, the container process needs to see a separate isolated environment, rather than the file system that inherits the host. Next, demonstrate a fake image to mimic Docker's Mount Namespace. That is, to implement a relatively complete and independent root file system for the child process, so that the process can only access the contents of its own file system (think about how we usually use Docker containers).

First, we use docker export to export the busybox image to a rootfs directory, which, as shown in the figure, already contains special directories such as / proc, / sys, and so on.

Then we remount some special directories in the code and use the chroot () system call to change the root directory of the process to the rootfs directory above.

Char* const container_args [] = {"/ bin/sh", NULL}; int container_main (void* arg) {printf ("Container [% 5d]-inside the container!\ n", getpid ()); sethostname ("container_dawn", 15); if (mount ("proc", "rootfs/proc", "proc", 0, NULL)! = 0) {perror ("proc") } if (mount ("sysfs", "rootfs/sys", "sysfs", 0, NULL)! = 0) {perror ("sys");} if (mount ("none", "rootfs/tmp", "tmpfs", 0, NULL)! = 0) {perror ("tmp") } if (mount ("udev", "rootfs/dev", "devtmpfs", 0, NULL)! = 0) {perror ("dev");} if (mount ("devpts", "rootfs/dev/pts", "devpts", 0, NULL)! = 0) {perror ("dev/pts") } if (mount ("shm", "rootfs/dev/shm", "tmpfs", 0, NULL)! = 0) {perror ("dev/shm");} if (mount ("tmpfs", "rootfs/run", "tmpfs", 0, NULL)! = 0) {perror ("run") } if (chdir (". / rootfs") | | chroot (". /")! = 0) {perror ("chdir/chroot");} / / after changing the root directory, / bin/bash searches for execv (container_args [0], container_args); perror ("exec"); printf ("Something's wrong!\ n"); return 1 from the changed root directory. } int main () {printf ("Parent [% 5d]-start a container!\ n", getpid ()); int container_id = clone (container_main, container_stack + STACK_SIZE, CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL); waitpid (container_id, NULL, 0); printf ("Parent-container stopped!\ n"); return 0;}

Finally, check out the implementation effect as shown in the following figure.

In fact, Mount Namespace was invented based on the continuous improvement of chroot.

Chroot can be regarded as the first Namespace in Linux. Then the above file system mounted on the container root directory to provide an isolated execution environment for the container image is the so-called container image, also known as rootfs (root file system). To be clear, rootfs is just a file, configuration, and directory contained by the operating system, not the operating system kernel.

User Namespace

The UID and GID seen inside the container are different from those outside. For example, the container inside shows 0 for the user dawn, but in fact, the user should be 1000 on the host. To achieve this effect, you need to map the UID inside the container to the UID of the host. The files you need to modify are / proc//uid_map and / proc//gid_map. The format of these two files is

ID-INSIDE-NS ID-OUTSIDE-NS LENGTH

ID-INSIDE-NS: represents the UID or GID displayed inside the container

ID-OUTSIDE-NS: real UID and GID that represent the mapping outside the container

LENGTH: indicates the range of the mapping, usually 1, indicating one-to-one correspondence

For example, here is the mapping of the real uid=1000 to uid= 0 in the container:

$cat / proc/8353/uid_map 0 1000 1

For example, the following means that the uid starting from 0 inside the namesapce is mapped to the external uid starting from 0, with a maximum range of unsigned 32-bit integers (the following command is entered in the host environment).

$cat / proc/$$/uid_map 0 0 4294967295

By default, if the CLONE_NEWUSER parameter is set but the above two files are not modified, 65534 is displayed by default in the container, because the container cannot find the real UID, so the maximum UID is set. As shown in the following code:

# define _ GNU_SOURCE # include # define STACK_SIZE (1024 * 1024) static char container_ Stack [stack _ SIZE]; char* const container_args [] = {"/ bin/bash", NULL}; int container_main (void* arg) {printf ("Container [% 5d]-inside the container!\ n", getpid ()) Printf ("Container: eUID =% ld; eGID =% ld, UID=%ld, GID=%ld\ n", (long) geteuid (), (long) getegid (), (long) getuid (), (long) getgid ()); printf ("Container [% 5d]-setup hostname!\ n", getpid ()); / / sethostname sethostname ("container", 10); execv (container_args [0], container_args) Printf ("Something's wrong!\ n"); return 1;} int main () {const int gid=getgid (), uid=getuid (); printf ("Parent: eUID =% ld; eGID =% ld, UID=%ld, GID=%ld\ n", (long) geteuid (), (long) getegid (), (long) getuid (), (long) getgid ()) Printf ("Parent [% 5d]-start a container!\ n", getpid ()); int container_pid = clone (container_main, container_stack+STACK_SIZE, CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWUSER | SIGCHLD, NULL); printf ("Parent [% 5d]-Container [% 5d]!\ n", getpid (), container_pid); printf ("Parent [% 5d]-user/group mapping done!\ n", getpid ()) Waitpid (container_pid, NULL, 0); printf ("Parent-container stopped!\ n"); return 0;}

When I execute the program with the user dawn, it will show the effect shown in the following figure. It is the same when using root users:

Next, we are going to start to implement the mapping effect, that is, let the user dawn appear as 0 in the container. The code is almost entirely from Uncle Mouse's blog, and the link can be seen at the end of the article:

Int pipefd [2]; void set_map (char* file, int inside_id, int outside_id, int len) {FILE* mapfd = fopen (file, "w"); if (NULL = = mapfd) {perror ("open file error"); return;} fprintf (mapfd, "d% d% d", inside_id, outside_id, len); fclose (mapfd) } void set_uid_map (pid_t pid, int inside_id, int outside_id, int len) {char file [256]; sprintf (file, "/ proc/%d/uid_map", pid); set_map (file, inside_id, outside_id, len);} int container_main (void* arg) {printf ("Container [% 5d]-inside the container!\ n", getpid ()); printf ("Container: eUID =% ld") EGID =% ld, UID=%ld, GID=%ld\ n ", (long) geteuid (), (long) getegid (), (long) getuid (), (long) getgid (); / * wait for the parent process to notify before going down (inter-process synchronization) * / char ch; close (pipefd [1]); read (pipefd [0], & ch, 1) Printf ("Container [% 5d]-setup hostname!\ n", getpid ()); / / sethostname sethostname ("container", 10); / / remount "/ proc" to make sure the "top" and "ps" show container's information mount ("proc", "/ proc", "proc", 0, NULL); execv (container_args [0], container_args); printf ("Something's wrong!\ n"); return 1 } int main () {const int gid=getgid (), uid=getuid (); printf ("Parent: eUID =% ld; eGID =% ld, UID=%ld, GID=%ld\ n", (long) geteuid (), (long) getegid (), (long) getuid (), (long) getgid ()); pipe (pipefd); printf ("Parent [% 5d]-start a container!\ n", getpid ()) Int container_pid = clone (container_main, container_stack+STACK_SIZE, CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUSER | SIGCHLD, NULL); printf ("Parent [% 5d]-Container [% 5d]!\ n", getpid (), container_pid) / / To map the uid/gid, / / we need edit the / proc/PID/uid_map (or / proc/PID/gid_map) in parent set_uid_map (container_pid, 0, uid, 1); printf ("Parent [% 5d]-user/group mapping done!\ n", getpid ()); / * notify child process * / close (pipefd [1]); waitpid (container_pid, NULL, 0) Printf ("Parent-container stopped!\ n"); return 0;}

The final effect is shown in the figure. You can see that the user UID dawn is displayed as 0 (root) inside the container, but the / bin/bash process in this container is still running as an ordinary user, that is, dawn, but the UID shown is 0, so there is still no permission when viewing the / root directory.

User Namespace is run as a normal user, but other Namespace requires root permissions, so what to do when using multiple Namespace? We can first create a User Namespace with a normal user, and then map this normal user to root, then use root to create other Namespace in the container.

Network Namespace

Isolate the networks in the container, each with its own virtual network interface and IP address. In Linux, you can use the ip command to create a Network Namespace (in the source code of Docker, it does not use the ip command, but implements some of the functions within the ip command itself).

Let's use the ip command to explain the construction of Network Namespace, taking the bridge network as an example. The topology diagram of the bridge network is generally shown in the following figure, where br0 is the Linux bridge.

When using Docker, if you start a Docker container and use ip link show to check the network situation on the current host, you will see that there is a docker0 and a virtual network card of veth****. The virtual network card of this veth is veth in the figure above, and docker0 is equivalent to the br0 in the image above.

So, we can use the following commands to create an effect similar to docker (refer to Uncle Mouse's blog, link to the reference at the end of the article, and add some text to the image above).

# # 1. First, we add a bridge lxcbr0 to imitate docker0 brctl addbr lxcbr0 brctl stp lxcbr0 off ifconfig lxcbr0 192.168.10.1 up 24 up # to set the IP address # # 2 for the bridge. Next, we will create a network namespace named ns1 # and add a namesapce command to ns1 (use the ip netns add command) ip netns add ns1 # to activate loopback in namespace, that is, 127.0.0.1 (using ip netns exec ns1 is equivalent to entering the namespace of ns1, then ip link set dev lo up is equivalent to executing in ns1) ip netns exec ns1 ip link set dev lo up # # 3. Then, we need to add a pair of virtual network cards # add a pair of virtual network cards and pay attention to the veth type. There are two virtual network cards: the veth-ns1 and lxcbr0.1,veth-ns1 network cards are to be installed in the container, while the lxcbr0.1 is to be installed in the bridge lxcbr0, which is the veth in the image above. Ip link add veth-ns1 type veth peer name lxcbr0.1 # Press veth-ns1 into namespace ns1 so that there will be a new network card in the container. Ip link set veth-ns1 netns ns1 # renames the veth-ns1 in the container to eth0 (conflicts outside the container, but not inside the container) ip netns exec ns1 ip link set dev veth-ns1 name eth0 # assigns an IP address to the network card in the container And activate it ip netns exec ns1 ifconfig eth0 192.168.10.11 veth-ns1 24 up # above we press the veth-ns1 network card into the container, and then we need to add lxcbr0.1 to the brctl addif lxcbr0 lxcbr0.1 # on the bridge to add a routing rule for the container so that the container can access the outside network ip netns exec ns1 ip route add default via 192.168.10.1 # 4. Set resolv.conf for this namespace so that the domain name echo "nameserver 8.8.8.8" > conf/resolv.conf can be accessed in the container.

The above is basically equivalent to the principle of docker network, except:

Instead of using the ip command, Docker implements some of the functions within the ip command itself.

Instead of using this method, Docker's resolv.conf writes it to the specified resolv.conf, and then loads it into the container's file system read-only together with hostname and host when the container is started.

Docker uses the PID of the process as the name of the network namespace.

Similarly, we can also add a new network card to the running docker container in the following way

Ip link add peerA type veth peer name peerB brctl addif docker0 peerA ip link set peerA up ip link set peerB netns ${container-pid} ip netns exec ${container-pid} ip link set dev peerB name eth2 ip netns exec ${container-pid} ip link set eth2 up ip netns exec ${container-pid} ip addr add ${ROUTEABLE_IP} dev eth2

Namespace status check

The operating interface of Cgroup is the file system, which is located in / sys/fs/cgroup. If you want to see namespace, you can also check the file system. Namespace mainly looks at the / proc//ns directory.

Let's take the above [PID Namespace program] (# PID Namespace) as an example. When this program is running, we can see that its PID is 11702.

After that, we keep this child process running, then open another shell and look at the PID of the child process created by this program, that is, the PID of the process running in the container on the host.

Finally, we look at the / proc/11702/ns and / proc/11703/ns directories, that is, the namespace of these two processes. You can see that cgroup, ipc, mnt, net and user are all the same ID, while pid and uts are different ID. If two processes have the same namespace number, it means that the two processes are in the same namespace, otherwise they are in different namespace.

If you can look beyond the ns, once these files are opened, the created namespace will always exist as long as the fd is occupied, even if all the processes in the namespace have ended. For example, you can use mount-- bind / proc/11703/ns/uts ~ / uts to keep the UTS Namespace of the 11703 process alive.

Summary

Namespace technology actually modifies the application process to look at the whole computer "view", that is, its "view" has been limited by the operating system, can only "see" some specified content, which only has an impact on the application process. However, for the host, these isolated processes are still processes, which are not much different from other processes on the host, and are all managed by the host. It's just that these isolated processes have additional set Namespace parameters. So what the Docker project plays here is more bypass-style assistance and management work. As shown in the figure on the left

Therefore, containers are more popular than virtual machines. This is if you use a virtual machine as an application sandboxie, then Hypervisor must be responsible for creating a virtual machine, which is real, and a complete Guest OS must be run in order to execute the user's application process. This leads to the use of virtual machines, which will inevitably lead to additional resource consumption and occupation. According to the experiment, after a KVM virtual machine running CentOS is started, the virtual machine takes up 100-200 MB of memory without optimization. In addition, the user application runs in the virtual machine, and its call to the host operating system will inevitably be intercepted and processed by the virtual machine software, which itself is a layer of consumption, especially the loss of resources, network and disk IO.

If you use the container way, the essence of the application after containerization is still a process on the host, which means that the performance loss caused by virtualization does not exist; on the other hand, containers using Namespace as a means of isolation do not need a separate Guest OS, which makes the additional resource consumption of the container almost negligible.

Generally speaking, "agility" and "high performance" are the biggest advantages of containers over virtual machines, which is an important reason why containers are so popular on a more fine-grained resource management platform such as PaaS.

But! The isolation mechanism based on Linux Namespace also has many shortcomings compared with virtualization technology, among which the main problem is that the isolation is not complete.

First of all, the container is just a special process running on the host, so the containers still use the operating system on the same host. Although other different versions of operating system files, such as centos and ubuntu, can be mounted separately through mount namesapce in the container, this does not change the fact that the host kernel is shared. This means that running the Linux container on windows or running a higher version of the Linux container on an earlier version of the Linux host machine is not feasible.

Virtual machines with virtual machine technology and independent Guest OS are much more convenient.

Second, in the Linux kernel, there are many resources and objects that cannot be namespace, such as time. If the program in your container modifies the time using the settimeofday (2) system call, the entire host time will be changed accordingly.

Compared with the freedom that can be tossed around at will in a virtual machine, "what can and cannot be done" is a question that users must consider when deploying applications in a container. In addition, the attack surface exposed by the container to the application is quite large, and the application of jailbreak is much less difficult than the virtual machine. Although, in practice, technologies such as Seccomp can be used to filter and identify all system calls initiated within the container for security reinforcement, this method will also affect the performance of the container because of the additional layer of filtering system calls. Therefore, in the production environment, no one dares to expose the Linux container running on the physical machine directly to the public network.

In addition, the container is a "single process" model. The essence of the container is a process, and the user's application process is actually the process of the PID=1 in the container, and this process is also the parent process of all subsequent processes. This means that in a container, you can't run two different applications at the same time unless you can find a common PID=1 program to act as the parent of both, such as using systemd or supervisord. The design of the container is more likely to want the container and the application to be the same as the life cycle, rather than that the container is still running, and the application in it is already dead.

These are all the contents of this article entitled "how to understand the Namespace Mechanism". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report