Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

CRI + shimv2: a kind of Kubernetes?

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

The current focus of the Kubernetes project is to expose more interfaces and extensible mechanisms for developers and users, and to delegate more user requirements to the community. Among them, the most mature and important interface is CRI. In 2018, the emergence of shimv2 API, led by the containerd community, brings a more mature and convenient practice for users to integrate their own container runtime on the basis of CRI.

This lecture shares the design and implementation of key technical features such as Kubernetes interface design, CRI, container runtime, shimv2, RuntimeClass, etc., and takes KataContainers as an example to demonstrate the use of the above technical features for the audience. This article is based on the shorthand of Zhang Lei's speech at KubeCon + CloudNativeCon 2018.

Zhang Lei

Alibaba senior technical expert

Senior members of the Kubernetes community and project maintainers

Zhang Lei, senior technical expert of Ali Group, senior member of Kubernetes project and co-maintainer. It mainly focuses on the characteristics of container runtime interface (CRI), scheduling, resource management and container runtime based on virtualization technology, and is jointly responsible for the engineering work of the upstream Kubernetes and Ali Group large-scale cluster management system. Zhang Lei has worked on Microsoft Research (MSR) and the KataContainers team, and was a popular speaker at the KubeCon conference.

Today, I share with you about the usage and integration of Containerd+KataContainers+Kubernetes. Hello, everyone. I'm Zhang Lei. I'm working in Alibaba Group now. Now that we're going to talk about the Kubernetes project today, let's first take a brief look at how the Kubernetes project works.

How Kubernetes works

In fact, everyone knows that the top layer of the Kubernetes project is a layer of Control Panel, which is also called the Master node by many people. When you submit workload as your application to Kubernetes, the first thing to do for you is API server, which will save your Application in etcd and in etcd as an API object.

In Kubernetes, the choreographer is Controller manager, and a pile of controller is looped in the run through control. Choreography is done through this control loop to help you create the Pod needed for these applications, not a container, but a Pod.

Once a Pod appears, Scheduler will watch the changes to the new Pod. If he finds a new Pod appears, Scheduler will help you to run all the scheduling algorithms and put the result of the run: a Node name written on my Pod object NodeName field, which is a so-called bind operation.

Then write the results of bind back into etcd, which is the so-called Scheduler working process. So what is the end result of Control Panel after such a busy circle? One of your Pod is associated with a Node bind, which is called Schedule.

And Kubelet? It runs on all nodes. Kubelet will watch all the changes to the Pod object, and when it finds that a Pod is bound to a Node, and it finds that the bound Node is itself, then Kubelet will help you take over everything else.

What if you look at Kubelet and see what it's doing? Quite simply, when Kubelet gets this information, he is going to call the Containerd process you run on each machine, to run every container in the Pod.

At this time, Containerd helps you to call runC, so in the end, runC actually helps you to set up these namespace, Cgroup and other things, it is to help you chroot, "build" the so-called application and needed container. This is a simple principle of how the whole Kubernetes works.

Linux Container

So at this point you may ask a question, what is a container? In fact, the container is very simple, we usually say that this container is the Linux container, you can divide the Linux container into two parts: the first is Container Runtime, the second is Container Image.

The so-called Runtime part is the dynamic view and resource boundaries of the process you are running, so it is built for you by Namespace and Cgroup. For Image, you can think of it as a static view of the program you want to run, so it's actually your program + data + all dependencies + all directory files make up a zip package.

When these compressed packages are mount together in the way of union mount, we call them rootfs. Rootfs is a static view of your entire process, and that's how they see the world, so this is Linux Container.

KataContainer

But today we are going to talk about another kind of Container, which is very different from the previous Linux Container. His Container Runtime is implemented in hypervisor and hardware virtualization, just like a virtual machine. So every KataContainer Pod like this is a lightweight virtual machine with a complete Linux kernel.

So we often say that KataContainer provides the same strong isolation as VM, but because of its optimization and performance design, it has agility comparable to container items. This point will be emphasized later, but for the mirror section, KataContainer is no different from Docker in that it uses a standard Linux Continer container and supports standard OCR Image, so this part is exactly the same.

Container security

But you might ask why we have a program like KataContainer? In fact, it is very simple, because we are concerned about security, such as many financial scenarios, encryption scenarios, and even many current blockchain scenarios, we need a secure Container Runtime, so this is one of the reasons why we emphasize KataContainer.

If you are using Docker now, one question I ask is how can you use Docker safely? You may have a lot of tricks to do. For example, if you drop some Linux capibility, you can specify what Runtime can do and what it can't do.

Second, you can go to read-only mount points. Third, you can use tools such as SELinux or AppArmor to protect the container. Another way is to reject some syscalls directly, and you can use SECCOMP.

But I need to emphasize that all these operations will introduce a new layer between your Container and Host, because it has to filter, it has to intercept your syscalls, so the more layers you build in this part, the worse the performance of your container, it must have additional negative performance loss.

More importantly, before you do these things, you need to figure out what you should do and which syscalls you should drop, which requires a specific analysis of the specific problem, so how should I tell my users how to do it?

So, these things are simple to say, but few people know exactly how to do them in practice. So in 99.99% of cases, most people run containers into virtual machines, especially in public cloud scenarios.

For a project like KataContainer, because it uses the same hardware virualization as a virtual machine, it has a separate kernel, so the isolation it provides at this time is completely trustworthy, just like you trust VM.

More importantly, since there is an Independent Kernel in each Pod, just like a small virtual machine, the version of Kernel that your container is allowed to run is completely different from Host machine adaptation.

This is completely OK, just like you do in a virtual machine, so that's one of the reasons I emphasize KataContainers, because it provides security and multi-tenant capabilities.

Kubernetes + security container

So it's natural to have a need, that is, how do we put KataContainer run in Kubernetes?

Well, at this time, let's take a look at what Kubelet is doing, so Kubelet has to find a way to call KataContainer like call Containerd, and then KataContainer is in charge of helping to set up hypervisor and help me run this little VM. So at this time, we need to think about how to make Kubernetes operate KataContainers reasonably.

Container Runtime Interface (CRI)

For this demand, it is related to Container Runtime Interface, which we have been promoting in the community, we call it CRI.

CRI actually has only one function: it describes what operations a Container should have and what parameters each operation should have for Kubernetes, which is a design principle of CRI. It should be noted, however, that CRI is a container-centered API, and there is no concept of Pod in it. Remember this.

Why do you say that? Why are we designing this way? Quite simply, we don't want a project like Docker to understand what Pod is and expose the API of Pod. This is an unreasonable demand. Pod is always an orchestration concept of Kubernetes, which has nothing to do with containers, so that's why we want to make this API Containerd-centric.

Another reason is due to the consideration of maintain, because if there is the concept of Pod in CRI, then any subsequent change in Pod feature may cause a change in CRI, which is expensive for an interface. So if you take a closer look at CRI, you will see that it actually has some very common container interfaces.

Here, I can roughly divide CRI into Container and Sandbox. Sandbox is used to describe what mechanism I use to implement Pod, so it is actually the field that the concept of Pod is really related to the container project. For the Docker or Linux container, what it actually match to the last run is a container called infra container, which is a tiny container that is used to hold the Node and Namespace of the entire Pod.

However, if Kubernetes uses Linux Container Runtim, such as Docker, it will not provide you with Pod level's isolation, except for a layer of Pod level cgroups. This is a difference. Because, if you use KataContainers, KataContaniners will create a lightweight virtual machine for you at this step.

Next, when it comes to the API of Containers, for Docker, it starts the user container on the host machine, but this is not the case for Kata. It will set up the Namespace required by these user containers in the lightweight virtual machine corresponding to the previous Pod, that is, in the Sandbox created earlier, instead of spending time with you in the new container.

So with such a mechanism, when the above Contol Panel finishes its work, it says that I have dispatched the Pod. At this time, when the Kubelet side starts or creates this Pod, it will go all the way, and the last step will go to call our so-called CRI. Before that, there was no concept of Containers runtime in Kubelet or Kubernetes.

So after you get to this point, if you use Docker, then Dockershim is responsible for responding to the CRI request in Kubernetes. But if you don't use Docker, you always have to follow a mode called remote, that is, you need to write a CRI Shim to serve this CRI request, which is the next topic we're talking about today.

How does CRI Shim work?

What can CRI Shim do? It can translate CRI requests into Runtime API. Let me give an example, for example, now there is an A container and a B container in a Pod. After we submit this matter to Kubernetes, the CRI code initiated at the other end of the Kubelet is about this sequence: first of all, it will run Sandbox foo, if it is Docker, it will set up an infra container, that is, a very small container called foo, if it is Kata, it will give you a virtual machine called foo, which is different.

So the next time you creat start container An and B, there are two containers in Docker, but in Kata it is in my little virtual machine, and in this Sandbox there are two small NameSpace, which is different.

So if you sum up all these things, you'll find OK, I'm going to put Kata run in Kubernetes now, so I'm going to do the work, and if I need to do this CRI shim at this step, I'll find a way to make a CRI shim for Kata.

And we can think of a way whether I can reuse the current CRI shim. Which ones are reused now? For example, the CRI containerd project is a containerd CRI shim, which can respond to CRI requests, so next I can translate these situations into Kata operations, so this is OK, and this is also the way we will use, that is, to connect the KataContainers to my Containerd.

At this time, it works like this. Containerd has a unique design, that is, it will call each Contaner Contained shim. After you run, you will look at his host and run a piece of this Containerd shim one by one.

At this time, because Kata is such a container runtime with the concept of Sandbox, Kata needs to match the relationship between these Shim and Kata, so Kata does a Katashim. Put these things together and translate the way your Contained is handled into the request of kata, which is what we did before.

But you can see that there are actually some problems, the most obvious problem is that for Kata or gVisor, they all have the concept of physical Sandbox, and with the concept of Sandbox, it should no longer have a shim match for every Container startup, because it brings us a lot of extra performance loss. We don't want every container to match a shim, we want a Sandbox match and a shim.

In addition, you will find that CRI serves Kubernetes, and it presents an upward reporting status, it helps Kubernetes, but it does not help Container runtime. So when you do this integration, you will find that especially for VM gVisor\ KataContainer, it does not correspond to many of the assumptions of CRI or the way API is written. So your integration work will be more laborious, this is a non-match state.

The last one is that it is very difficult for us to maintain, because with CRI, for example, RedHat has its own CRI implementation called cri-o, they are essentially no different from containerd. In the end, they all rely on runC to lift the container, so why do you need this kind of thing?

We don't know, but as Kata maintainer, I need to write two parts of integration to integrate Kata into them. This is troublesome, which means that I have 100 kinds of CRI and I have to write 100 integrations, and their functions are all duplicated.

Containerd ShimV2

So this thing that I give you propose today is called Containerd ShimV2. As we said earlier, CRI,CRI determines the relationship between Runtime and Kubernetes, so can we now have a more detailed layer of API to determine what the real interface between my CRI Shim and the following Runtime is?

That's why ShimV2 appears. It's a standard interface from CRI shim to Containerd runtime, so I went directly from CRI to Containerd to runC, but now it's not. We go from CRI to Containerd to ShimV2, then ShimV2 to RunC and then to KataContainer. What's the advantage of doing this?

Let's take a look. The biggest difference is that in this way, you can specify a Shim for each Pod. Because in the beginning, Containerd directly launched a Containerd Shim to respond, but our new API is written as Containerd Shim start or stop. So how to implement this start and stop operation is what you need to do.

Now, as a maintainer of a KataContainers project, I can do this. When I call this start when I am in created Sandbox, I start a Containerd Shim. But when my next step is call API, in the previous CRI, Container API, I no longer play, I am reuse, I reuse the Sandbox created for you, which provides a lot of freedom for your implementation.

So at this time you will find that the whole way of implementation has changed, and after Containerd is used, it is no longer going to care each container to Containerd Shim, but by yourself. My way of implementation is that I only create a containerd-shim-v2 when I am in Sandbox, and then I will go to this containerd-shim-v2 for the whole subsequent container level operation, and I will reuse this Sandbox, so this is very different from the previous time.

So if you sum up this picture now, you will find that the way we implement it is like this:

First of all, you still use the original CRI Containerd, but now you install runC, and now you install a katacontainer on top of that machine. Next, we will write an implementation called kata-Containerd-Shimv2 for you from Kata. So I have to write a big pile of CRI stuff in front of me, but I don't need it now.

Now, we only focus how to dock Containerd with kata container, that is, the so-called implementation of Shimv2 API, this is what we have to do. And when it comes to what we're going to do here, it's actually a series of API related to run a container.

For example, I can go to create, start, these operations are all mapped on my Shimv2 to achieve, not that I am now thinking about how to map, to achieve CRI, this degree of freedom is too large before, resulting in our current situation, there are a lot of CRI Shim available. This is actually a bad thing. There are a lot of political reasons, there are many non-technical reasons, this is not what we as technicians should care about, you just need to think about how I am going to dock with Shimv2.

Next, I'll show you a Demo that calls KataContainers through CRI + containerd shimv2.

Summary

The core design idea of Kubernetes now is to peel off and decouple the intruding features that are originally complex and intrusive to the backbone code from the core library through interface and plug-in. In this process, CRI is the earliest plug-in call interface in the Kubernetes project.

This sharing mainly introduces you to another way to integrate the container runtime based on CRI, that is, CRI + containerd shimv2. In this way, you no longer need to write a CRI implementation (CRI shim) for your own container runtime. Instead, you can directly reuse containerd's ability to support CRI, and then connect specific container runtimes (such as runc) through containerd shimv2.

At present, this integration method has become the mainstream idea of community docking lower-level container runtimes. Like many container projects based on independent kernel or virtualization, such as KataContainers,gVisor,Firecracker, they also begin to seamlessly connect to Kubernetes through shimv2 and then with the help of containerd projects.

In fact, PouchContainer itself chose containerd as its main container runtime management engine and implemented an enhanced CRI interface to meet Alibaba's strongly isolated, production-level container requirements.

So after the release of shimv2 API in the containerd community, the PouchContainer project has taken the lead in exploring and trying to connect the lower layer container runtime through containerd shimv2, so as to more efficiently complete the integration of other kinds of container runtime, especially virtualized containers.

We know that since open source, the PouchContainer team has been actively promoting the development and evolution of the upstream community of containerd, and in this CRI + containerd shimv2 revolution, PouchContainer is once again at the forefront of each container project.

The authoritative hospital of condyloma in Shenyang: http://www.sdxb024.com/

Look at STDs in Shenyang. Good: http://xb.029nk.com/

Which hospital does Shenyang go to to see venereal diseases: http://www.120sysdyy.com/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report