Get started with K8s from scratch | the founder of Kata Containers will introduce you to secure container technology. 07/06 Update SLTechnology News&Howtos

Get started with K8s from scratch | the founder of Kata Containers will introduce you to secure container technology.

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Author | Wang Xu Ant Financial Services Group Senior Technical expert

This article is compiled from lesson 28 of the "CNCF x Alibaba Cloud Native Technology Open course". Click on the direct course page.

Follow the official account of "Alibaba Cloud Origin" and reply to the keyword "getting started" to download the article PPT of the K8s series from zero.

Origin: naming of security containers

Phil Karlton has a famous saying: "there are only two real problems in computer science-cache invalidation and naming."

For us in the container circle, I believe that "naming" is absolutely worthy of this sentence. There is no doubt that this is something that silences the old developers and makes the newcomers cry. As far as system software is concerned, we now call it the concept of "Linux container technology". It has also been used by the names Jail, Zone, Virtual Server, Sandbox and so on. Similarly, in the early technology stack of virtualization, a class of virtual machines was called containers. After all, the word itself refers to devices used to contain, encapsulate, and isolate. It is so common that the notoriously rigorous Wikipedia has an entry called "OS-Level Virtualization" (system-level virtualization), thus avoiding the question of "what is a container?"

After the advent of Docker in 2013, the concept of container, along with a series of concepts such as immutable infrastructure and cloud nativism, subverted application deployment based on the fine-grained combination of "package + configuration" in the following years, defining the software stack with simple declarative policies and immutable containers. How to deploy the application seems to be a little beside the point here. What I want to emphasize here is:

"containers in the native context of the cloud are essentially" application containers "- applications packaged in a standard format that run on a standard operating system environment (often Linux ABI)-or programs / technologies that run this application package."

This definition is mine, but it is not my own will, it is based on the consensus of the OCI specification. This specification specifies the environment in which the application in the container is placed and how to run, such as which executable file will be executed on the root file system of the container, which user will be executed, what kind of CPU is required, what kind of memory resources, external storage, and what kind of sharing requirements.

Therefore, the packaging of the application container is composed of the encapsulation of the standard format and the standard operating system environment centered on the application.

Based on this consensus, we can talk about security containers. At that time, my co-founder Zhao Peng and I used the name "virtual container" to name our technology, but in order to attract people's attention, we used Slogan like "Secure as VM, Fast as Container". As a result, people who were poked at the center of container security issues immediately called this kind of thing "Secure Container" or "secure container". Although in our hearts, this technology is an extra layer of isolation, it is only a part of security, but users are still willing to call it by the name of security container. Our definition of a secure container is:

Secure container is a run-time technology that provides a complete operating system execution environment (often Linux ABI) for container applications, but isolates the execution of the application from the host operating system to prevent applications from directly accessing host resources, thus providing additional protection between container hosts or containers.

This is our safe container.

Second, the indirect layer: the essence of safety containers.

When talking about safe containers, mention the word "indirect layer". It comes from what Linus Torvalds proposed at LinuxCon in 2015:

"the only solution to the security problem is to allow those Bug that cause the security problem to occur, but block them through additional isolation layers."

For the sake of security, why introduce an isolation layer? In fact, the scale of Linux itself is so large that it is impossible to theoretically verify that the program does not have Bug, so once the appropriate Bug is utilized, the security risk becomes a security problem. Security frameworks and patches do not ensure security, so we need some additional isolation to reduce vulnerabilities and the risk of complete breach because of these vulnerabilities.

This is the origin of security containers.

III. Kata Containers: the Virtualization of Yunyuan Biochemistry

In December 2017, we released Kata Containers's secure container project on KubeCon, which has two predecessors: runV, which we started before, and Intel's Clear Container project. Both projects began in May 2015 and actually preceded what Linus said at KubeCon 2015.

Their ideas are simple:

The container mechanism of the operating system itself cannot solve the security problem and needs an isolation layer; the virtual machine itself, VM, is a ready-made isolation layer, such as Aliyun and AWS, which all use virtualization technology, so it is widely believed that for users, as long as they can achieve "secure of VM", then this security can meet the needs of the public cloud. If there is a kernel in the virtual machine, it can support the definition of OCI that we just mentioned, that is to say, it provides a running environment for Linux ABI, in which it is not difficult to run a Linux application.

The problem now is that the virtual machine is not fast enough, which hinders its application in the container environment. If we can have "speed of container", then we may be able to have a secure container technology that uses virtual machines for isolation. This is one of the ideas of Kata Containers itself, which is to use a virtual machine to do the PodSandbox of Kubernetes. In Kata, there are qemu, firecracker, ACRN, cloud-hypervisor and so on that are used as VM.

The following figure shows how Kata Containers integrates with Kubernetes. The example here uses containerd, and of course CRI-O is the same.

Currently, Kata Containers is usually used in Kubernetes. First, Kubelet finds containerd or CRI-O through the CRI API. At this time, operations such as mirroring are usually performed by containerd or CRI-O. Upon request, it turns the requirements of the runtime part into an OCI spec and hands it over to OCI runtime for execution. Take, for example, the kata-runtime in the upper half of the image, or the simplified containerd-shim-kata-v2 in the lower half. The specific process is as follows:

When containerd gets a request, it first creates a shim-v2. This shim-v2 is a representative of PodSandbox, that is, the representative of that VMM; each Pod has a shim-v2 to perform various operations for containerd/CRI-O. Shim-v2 will start a virtual machine for the Pod and run a linux kernel, which is the Guest kernel in the figure. If qemu is used in this one, we will make it smaller with some configuration and patches. At the same time, there is no additional Guest operating system, and there will not be a complete operating system like CentOS and Ubuntu. Later, we will give the spec of the container and the packaged storage of the container itself, including rootfs and file system, to this PodSandbox. This PodSandbox will be started by kata-agent in the virtual machine; according to CRI semantics and OCI specification, multiple associated containers can be started in a Pod. They will be put into the same virtual machine, and some namespace; can be shared according to needs. In addition to these, some external storage and volumes can also be plugged into this PodSandbox through hot plug; for the network, almost all Kubernetes CNI plug-ins can be seamlessly accessed using tcfilter. And we also provide an enlightened mode, so that there will be a special CNI plug-in to improve the network capabilities of the container.

As you can see, in our PodSandbox, there is actually only one Guest Kernel running some packaging and container applications of the container itself, and does not include a complete operating system. That is to say, this process, it does not use like a traditional virtual machine, for containers, it only has the container engine, and further reduces memory overhead by using less unnecessary memory and sharing shared memory.

Compared with traditional virtual machines, it costs less and starts more easily, and for most scenarios, it can do "secure as VM" and "fast as container". At the same time, in addition to the security technology, compared with the traditional virtual machine, it has more flexibility and less physical feel of the machine, such as the insertion and unplugging of dynamic resources and the use of technologies such as virtio-fs. It is a technology designed specifically for our scenario, for scenarios like kata, to share the basic file system contents of host (such as container rootfs) with virtual machines.

Through some of the DAX technologies previously done for non-volatile storage and non-volatile memory, it is possible to share read-only memory parts that can be shared between different PodSandbox, that is, between different Pod, and between different containers. This saves a lot of memory between different PodSandbox. At the same time, all Pod management is container management from the outside through Kubernetes, and access to metrics and debug information from the outside, and there is no such a feel of logging into the virtual machine. So it looks like a very containerized operation, although at the bottom, it is still a virtual machine, but in fact it is a cloud native virtualization.

IV. GVisor: process-level virtualization

GVisor, which we also call process-level virtualization, is another approach that is different from kata.

In May 2018, on KubeCon in Copenhagen, Google opened up the gVisor security container they had developed internally for five years in response to kata containers, indicating that they have a different solution for secure containers.

If Kata Containers builds an isolation layer between containers by combining and modifying existing isolation technologies, then the design of gVisor is obviously more concise.

As shown on the right side of the figure above, it is an operating system kernel rewritten in Go language that runs in user mode. This kernel is called sentry, and it does not rely on virtualization and virtual machine technology. On the contrary, it makes the host operating system do an operation and transfer all the expected operating system operations of the application to sentry with the help of an internal ability called a Platform (platform). After sentry does the processing, it will give some of it to the operating system to help it do it, and most of it will be done by itself.

GVisor is a pure application-oriented isolation layer, from the beginning is not completely equivalent to a virtual machine, it is used to run a Linux program on Linux. As an isolation layer, its security is based on:

GVisor developers will first reduce the attack surface, and the host operating system will only provide about 20% of the system calls for applications in the sandbox.

Linux has about 300 Syscall. In fact, sentry's final call to the operating system will only focus on more than 60 Syscall. This is because gVisor developers have done some research on the security of the operating system, and they have found that most successful attacks on the operating system come from less commonly used system calls.

This is easy to understand, because the implementation paths of less commonly used system calls are generally relatively old, that is to say, the development of these parts is generally not very active, there are only a few developers to maintain, and the code on those hot paths is more secure, because those codes are used by review more times. So gVisor is designed to allow applications to access Syscall that are not commonly used at all to the operating system level, but only to dispose of it in sentry.

Those who access the host from sentry only use system calls on validated, more mature, hotter paths, so that security is much better than it used to look. Our Syscall is now the original 1max 5, but the possibility of being attacked is not less than 1max 5.

Second, they found that some frequently attacked system calls need to isolate it, such as open (), which is the operation that opens the file.

In Unix systems, most things are files, so open can do too much, and most attacks are carried out through open. The developers of gVisor put open into a separate process called Gofer. A separate process is actually more container protected by seccomp, by some system limitations, by some "capbility drop". Gofer can do less and can be executed with non-root, so that the security of the entire system is further improved.

Finally, sentry and Gofer are implemented in the Go language, not in the traditional C language.

The Go language itself is a more memory-secure implementation, so the entire gVisor is less vulnerable to attacks and memory problems. Of course, the Go language is not systematic enough in some places, and the developers of gVisor also admit that they have made a lot of adjustments to Go Runtime in order to do this, and sent these things back to the Go language community.

It can be said that the architecture of gVisor is very beautiful, and many developers have been honest with me that they actually like the architecture of gVisor and think it is simpler, purer and cleaner. Of course, while its architecture is beautiful, only giants like Google can do it to reimplement a kernel, and there may be Microsoft's WSL 1. And this design is relatively advanced, it actually has some problems:

First of all, sentry is not a Linux, so there is a gap in compatibility compared with solutions like kata. There is no way to do this, but for specific applications, this may not be a problem; second, for the current implementation of system calls, as well as the instruction system of CPU, we intercept Syscall from the application, and then send the Syscall to sentry to execute, the process itself is quite expensive. GVisor can have better performance under certain scenarios. However, in most scenarios, the performance of gVisor is still not as good as that of solutions like kata.

So a solution like gVisor won't be the ultimate solution in a short period of time, but it can adapt to specific scenarios, and it also brings some revelation. I think this revelation may have some effect on the development of future operating systems and CPU instructions. And I believe that in the future, both kata and gVisor will have an evolution, and we look forward to a common solution to solve the application execution problem uniformly.

5. Security containers: not only safety

Although the name of the security container is security, it provides an isolation. It does more than just be safe.

Security containers use isolation layer to make application problems, whether from external malicious attacks or unexpected errors, do not affect the host or interact with different Pod, so in fact, this additional isolation layer has an impact not only on security, but also on other aspects. It is beneficial to the scheduling of the system, the quality of service and the protection of application information.

We say that the traditional operating system container technology is an extension of kernel process management. The container process itself is a group of related processes, which is completely visible to the host scheduling system. All containers or processes in a Pod are also scheduled and managed by the host. This means that if you have an environment with a large number of containers, the burden on the host kernel itself will be heavy, and the overhead of this burden can already be observed in many real-world environments.

Especially with the continuous development of computer technology, an operating system will have a lot of memory, a lot of CPU, hundreds of gigabytes of memory can be seen. In this case, if a large number of containers are allocated, the scheduling system will have a very heavy overhead. After adopting the security container, the complete information will not be seen on the host, and this isolation layer also undertakes some scheduling of the applications above the isolation layer, so it only needs to schedule these sandboxes on the host. It reduces the scheduling overhead of the host, which is why it improves the scheduling efficiency.

While improving the scheduling efficiency, it will isolate all applications from each other, so as to avoid the interference between containers, containers and hosts, and improve the quality of service. On the other hand, our original intention of making a secure container is to protect the host from malicious or problematic applications in the container. On the other hand, as a cloud, we may face malicious attacks, so we are also protecting ourselves.

At the same time, the user does not want us to access the user's resources too much, the user needs to use the resource, but it does not need us to see its data. The security container can completely encapsulate what the user runs in the container, so that the operation and maintenance operations of the host can not access the application data, thus protecting the application data in a sandbox without having to encounter user data. If we want to access user data, as a cloud, we must ask the user to authorize you. At this time, the user is not sure if you have any malicious operations. If our sandbox is well encapsulated, then there is no need for additional user authorization requirements, which is better for protecting the privacy of users.

When we look to the future, we can see that the security container is not only doing security isolation, the kernel of the security container isolation layer is independent from the host kernel, dedicated to application services, from this point of view, the function between the host and the application is actually a reasonable function allocation and optimization. It can show a lot of potential, and the future security container may not only isolate the reduction of performance overhead, but also improve the performance of the application. Isolation technology will make the cloud native infrastructure more perfect.

VI. Summary of this article

This is the end of the main content of this article, here is a brief summary for you:

Now, the so-called "secure container" refers to a container runtime technology that provides a complete operating system execution environment (often Linux ABI) for container applications, but isolates the execution of the application from the host operating system to prevent the application from directly accessing host resources, thus providing additional protection between container hosts or containers. Kata Containers is an open source security container project that uses virtualization to provide an isolation layer, which is fully compatible with cloud native ecosystems such as Kubernetes. The project is hosted in OpenStack Foundation and jointly led by Ant Financial Services Group and Intel. GVisor is a secure container technology implemented by process-level virtualization technology, developed and open source by Google, and implemented a user-mode compatible kernel in Go language. Finally, the isolation provided by security containers is not only a part of security, but also provides isolation in terms of performance, scheduling, and management.

"Alibaba Cloud Native focus on micro services, Serverless, containers, Service Mesh and other technology areas, focus on cloud native popular technology trends, cloud native large-scale landing practice, to be the official account of cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.