How does kubectl create Pod 07/03 Update SLTechnology News&Howtos

How does kubectl create Pod

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "how to create Pod by kubectl". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Kubectl Authentication and Generator

When the enter key is pressed, kubectl first performs some client-side validation to ensure that illegal requests (such as creating an unsupported resource or using a malformed mirror name) will fail quickly and will not be sent to kube-apiserver. Improve system performance by reducing unnecessary loads.

After the verification is passed, kubectl begins to encapsulate the HTTP request sent to kube-apiserver. Kube-apiserver communicates with etcd, and all requests to access or change the state of the Kubernetes system are made through kube-apiserver, and kubectl is no exception. Kubectl uses a generator (generators) to construct HTTP requests. A generator is an abstract concept used to deal with serialization.

With kubectl run, you can not only run deployment, but also deploy a variety of other resource types by specifying the parameter, generator. If you do not specify a value for the-- generator parameter, kubectl automatically determines the type of resource.

For example, resources with the parameter-restart-policy=Always will be deployed as Deployment, while resources with the parameter-restart-policy=Never will be deployed as Pod. At the same time, kubectl also checks to see if other actions need to be triggered, such as logging commands (for rollback or auditing).

After kubectl determines that it is going to create a Deployment, it will use the DeploymentV1Beta1 generator to generate a runtime object from the parameters we provide.

API version negotiation and API group

To make it easier to eliminate fields or reorganize the resource structure, Kubernetes supports multiple API versions, each under a different API path, such as / api/v1 or / apis/extensions/v1beta1. Different versions of API indicate different levels of stability and support, and a more detailed description can be found in the Kubernetes API overview.

The API group aims to classify similar resources to make Kubernetes API easier to extend. The group name of the API is specified in the REST path or in the apiVersion field of the serialized object. For example, the API group name of Deployment is apps, and the latest version of API is v1beta2, which is why you type apiVersion: apps/v1beta2 at the top of Deployment manifests.

After kubectl generates the run-time object, it starts to find the appropriate API group and API version for it, and then assembles a versioned client that knows the various REST semantics of the resource. This phase is called version negotiation, and kubectl scans the / apis path on remote API to retrieve all possible API groups. Because kube-apiserver exposes the specification document in OpenAPI format on the / apis path, it is easy for the client to find the appropriate API.

To improve performance, kubectl caches the OpenAPI specification to the ~ / .kube/cache directory. If you want to know the process of API discovery, try to delete the directory and set the value of the-v parameter to the maximum when running the kubectl command, and you will see all HTTP requests that try to find these API versions. Refer to the kubectl cheat sheet.

The final step is to actually send the HTTP request. Once a successful response is obtained after the request has been sent, kubectl will print the success message according to the desired output format.

Client authentication

Client authentication needs to be done before sending a HTTP request, which was not mentioned before, so let's take a look at it now.

In order to send the request successfully, kubectl needs to authenticate first. The user credentials are saved in the kubeconfig file, and kubectl finds the kubeconfig file in the following order:

If the-- kubeconfig parameter is provided, kubectl uses the kubeconfig file provided by the-- kubeconfig parameter.

If the-- kubeconfig parameter is not provided, but the environment variable $KUBECONFIG is set, the kubeconfig file provided by that environment variable is used.

If the-- kubeconfig parameter and the environment variable $KUBECONFIG are not provided, kubectl uses the default kubeconfig file $HOME/.kube/config.

After parsing the kubeconfig file, kubectl determines the context to be used, the cluster you are currently pointing to, and any authentication information associated with the current user. If the user provides additional parameters (for example,-- username), these parameters take precedence to override the values specified in kubeconfig. Once the information is obtained, kubectl populates the information into the header of the HTTP request to be sent:

X509 certificates are sent using tls.TLSConfig (including CA certificates).

The bearer tokens is sent in the HTTP request header Authorization.

The username and password are sent through HTTP basic authentication.

The OpenID authentication process is handled manually by the user in advance, resulting in a token that is sent like bearer token.

2. Kube-apiserver certification

Now that our request has been successfully sent, what will happen next? It's time for kube-apiserver to make his debut! Kube-apiserver is the main interface used by clients and system components to store and retrieve cluster state. In order to perform the corresponding function, the kube-apiserver needs to be able to verify that the requester is legitimate, a process known as authentication.

So how does apiserver authenticate the request? When kube-apiserver starts for the first time, it looks at all the CLI parameters provided by the user and combines them into an appropriate list of tokens.

* * for example: * * if the-- client-ca-file parameter is provided, x509 client certificate authentication is added to the token list, and if the-- token-auth-file parameter is provided, breaer token is added to the token list.

"each time a request is received, apiserver authenticates through the token chain until a certain authentication is successful:"

The x509 processor verifies that the HTTP request is encoded by a TLS key signed by the CA root certificate.

The bearer token processor verifies that the token file provided by the-- token-auth-file parameter exists.

The basic authentication handler ensures that the basic authentication credentials requested by the HTTP match the local status.

If authentication fails, the request fails and the corresponding error message is returned; if the authentication is successful, the Authorization request header in the request is removed and the user information is added to its context. This provides subsequent authorization and admission controllers with the ability to access previously established user identities.

Authorization

OK, now that the request has been sent and kube-apiserver has successfully verified who we are, we are finally free!

However, the matter is not over, although we have proved that we are legal, but do we have the right to do so? After all, identity and authority are not the same thing. In order to perform subsequent operations, kube-apiserver also authorizes the user.

Kube-apiserver handles authorization in a similar way to authentication: through the startup parameter of kube-apiserver, the authorization_mode parameter setting. It will combine a series of authorizers who will authorize each incoming request. If all grantees reject the request, the request will be disabled and will not continue to respond. If an authorized person approves the request, the request continues.

Kube-apiserver currently supports the following authorization methods:

Webhook: it interacts with HTTP (S) services outside the cluster.

ABAC: it enforces the policies defined in the static file.

RBAC: it uses rbac.authorization.k8s.io API Group to make authorization decisions, allowing administrators to dynamically configure policies through Kubernetes API.

Node: it ensures that kubelet can only access resources on its own node.

Admission control

After breaking through the two gates of authentication and authorization mentioned earlier, can the client's call request get a real response from API Server? The answer is: no!

From kube-apiserver 's point of view, it has verified our identity and given us the appropriate permissions to continue, but as far as Kubernetes is concerned, other components have a problem with whether things should be allowed to happen. So this request also needs to be tested by an access control chain controlled by Admission Controller. There are nearly ten official standard "levels" and can be customized and extended!

While authorization focuses on answering whether the user has permissions, the admission controller intercepts the request to ensure that it meets the broader expectations and rules of the cluster. They are the last bastion before the resource object is saved to the etcd, encapsulating a series of additional checks to ensure that the operation does not produce unexpected or negative results. Unlike authorization and authentication, which only care about the requested user and operation, admission control also handles the content of the request and is valid only for creation, update, deletion, or connection (such as agents), but not for read operations.

The admission controller works like authorizers and verifiers, with one difference: unlike authentication chains and authorization chains, if an admission controller check fails, the entire chain will be interrupted. the entire request is immediately rejected and an error is returned to the end user.

The focus of the admission controller design is to improve scalability, a controller is stored as a plug-in in the plugin/pkg/admission directory, and matched with an interface, and finally compiled into the kube-apiserver binary file.

Most of the admission controllers are easy to understand, so let's focus on three admission controllers, SecurityContextDeny, ResourceQuota and LimitRanger.

SecurityContextDeny this plug-in will prohibit the creation of Pod with SecurityContext set.

ResourceQuota can limit not only the number of resources created in a Namespace, but also the total amount of resources requested by Pod in a Namespace. The admission controller implements resource quota management together with the resource object ResourceQuota.

LimitRanger acts like the ResourceQuota controller above, with resource quotas for each individual of the Namespace resource (Pod, Container, etc.). The plug-in implements resource quota management together with the resource object LimitRange.

3. Etcd

So far, Kubernetes has thoroughly reviewed the client's call request, and has verified it, so run it to the next step. The next step is for kube-apiserver to deserialize the HTTP request, then use the result to build the run-time object (a bit like the inverse process of the kubectl generator) and save it to etcd. Let's break down this process.

How does kube-apiserver know what to do when it receives a request? In fact, a series of very complex processes have been generated before the client sends the call request. Let's start with the first run of the kube-apiserver binaries:

When you run the kube-apiserver binaries, it creates a chain of services that allows apiserver aggregation. This is a way to extend Kubernetes API.

A generic apiserver is also created as the default apiserver.

The configuration of the apiserver is then populated with the generated OpenAPI specification.

Kube-apiserver then iterates through all the API groups specified in the data structure and saves each API group in etcd as a common storage abstraction. Kube-apiserver invokes these API groups when you access or change the state of the resource.

Each API group traverses all its group versions and maps each HTTP route to the REST path.

When the requested METHOD is POST, the kube-apiserver transfers the request to the resource creation processor.

Kube-apiserver now knows all the routes and their corresponding REST paths so that they know which processors and key-value stores to call when the request matches. What a witty design! Now assume that the client's HTTP request has been received by kube-apiserver:

If the processing chain can match the request to the registered route, the request will be handed over to the dedicated processor registered to the route for processing; if no route can match the request, the request will be forwarded to the path-based processor (for example, when / apis is called) If no path-based processor registers to the path, the request is forwarded to the not found processor and returns 404.

Fortunately, we have a registered route called createHandler! What does it do? First, it decodes the HTTP request and performs basic validation, such as ensuring that the json provided by the request matches the version of the API resource.

Then we enter the stage of audit and admission control.

The resource will then be saved to etcd via storage provider. Keys saved to etcd by default are in the format /, and you can also customize them.

Any errors that occur during resource creation are caught, and finally storage provider executes a get call to confirm that the resource was created successfully. If additional cleanup work is required, the processor and decorator created later will be invoked.

Finally, the HTTP response is constructed and returned to the client.

It turns out that apiserver has done so much work that he hasn't found it before. So far, the Deployment resource we created has been saved to etcd, but apiserver still can't see it.

4. Initialization

After a resource object is persisted to the data store, apiserver cannot fully see it or schedule it, and a series of Initializers needs to be performed before that. An Initializers is a controller associated with a resource type that executes some logic before the resource is available externally. If a resource type does not have Initializers, this initialization step is skipped to make the resource visible immediately.

As the boss's blog pointed out, Initializers is a powerful feature because it allows us to perform generic boot operations. For example:

Inject the proxy side car container into the Pod that exposes port 80, or add a specific annotation.

Inject the volume that holds the test certificate into all Pod in a specific namespace.

If the password in Secret is less than 20 characters, organize its creation.

The initializerConfiguration resource object allows you to declare which Initializers should be run for certain resource types. If you want to run a custom Initializers every time you create a Pod, you can do this:

ApiVersion: admissionregistration.k8s.io/v1alpha1kind: InitializerConfigurationmetadata: name: custom-pod-initializerinitializers:-name: podimage.example.com rules:-apiGroups:-"" apiVersions:-v1 resources:-pods

After the resource object InitializerConfiguration is created with this configuration, the custom-pod-initializer field is added to the metadata.initializers.pending field of each Pod. The initialization controller periodically scans the new Pod, and once its name is detected in the pending field of the Pod, it executes its logic, and then deletes its own name under the pending field.

Only the first Initializers in the list under the pending field can operate on the resource, and when all Initializers execution is complete and the pending field is empty, the object is considered to be initialized successfully.

You may notice a problem: if the kube-apiserver cannot display these resources, how does the user-level controller handle the resources?

To solve this problem, kube-apiserver exposes an includeUninitialized query parameter that returns all resource objects (including uninitialized ones).

5. Control cycle Deployments controller

At this stage, our Deployment records have been saved in etcd, and all the initialization logic has been executed, and the next phase will involve the topology on which the resource depends. In Kubernetes, Deployment is really just a collection of Replicaset, while Replicaset is a collection of Pod. So how does Kubernetes create these resources hierarchically from an HTTP request? In fact, all this work is done by Kubernetes's built-in Controller (controller).

Kubernetes uses a lot of Controller,Controller throughout the system is an asynchronous script for correcting the system state from the "current state" to the "expected state". All Controller run in parallel through kube-controller-manager components, and each Controller is responsible for a specific control flow. First of all, let's introduce Deployment Controller:

Once the Deployment record is stored in etcd and initialized, it can be made visible through kube-apiserver, and then Deployment Controller will detect it (its job is to listen for changes to the Deployment record). In our example, the controller registers a specific callback function that creates the event through an Informer (see below for more information).

When the Deployment is first visible, the Controller adds the resource object to the internal work queue and then starts working on the resource object:

Check whether the Deployment has a ReplicaSet or Pod record associated with it by querying the kube-apiserver using a tag selector.

Interestingly, this synchronization process is state-agnostic, and it checks new records in the same way as existing records.

After realizing that there are no ReplicaSet or Pod records associated with it, Deployment Controller starts the auto scaling process:

Create a ReplicaSet resource, assign it a tag selector, and set its version number to 1.

The PodSpec field of ReplicaSet is copied from the manifest of Deployment and other related metadata. Sometimes the Deployment record needs to be updated after that (for example, if process deadline is set).

When the above steps are completed, the status of the Deployment is updated, and then re-enter the same loop as before, waiting for the Deployment to match the desired state. Since Deployment Controller is only concerned with ReplicaSet, coordination needs to continue through ReplicaSet Controller.

ReplicaSets controller

In the previous step, Deployment Controller created the first ReplicaSet, but there is still no Pod, so it's time for ReplicaSet Controller to debut! ReplicaSet Controller's job is to monitor the lifecycle of ReplicaSets and its associated resources (Pod). Like most other Controller, it does this by triggering handlers of certain events.

When the ReplicaSet is created (created by Deployment Controller), the RS Controller checks the state of the new ReplicaSet and checks for deviations between the current state and the desired state, and then adjusts the number of copies of the Pod to achieve the desired state.

Pod is also created in batches, starting with SlowStartInitialBatchSize and then doubling with a slow start operation in each successful iteration. The goal is to reduce the risk of kube-apiserver being engulfed by a large number of unnecessary HTTP requests in the event of a large number of Pod startup failures (for example, due to resource quotas). If the creation fails, it is best to fail gracefully and have minimal impact on other system components!

Kubernetes constructs a strict resource object hierarchy through Owner References (the ID that references its parent resource in a field of a child resource). This ensures that once Controller-managed resources are deleted (cascaded deletions), child resources are deleted by the garbage collector, while providing an effective way for parent resources to prevent them from competing for the same child resource (imagine a scenario where two pairs of parents think they have the same child).

Another benefit of Owner References is that it is stateful. If any Controller is restarted, this operation will not affect the stable operation of the system because the topological relationship of the resource object has nothing to do with Controller. This emphasis on resource isolation is also reflected in the design of Controller itself: Controller cannot operate on resources that it does not explicitly own, but should choose ownership, non-interference and non-sharing of resources.

Sometimes there are orphaned resources in the system, which are usually generated in the following two ways:

Parent resources are deleted, but child resources are not deleted

Garbage collection policy prohibits deletion of child resources

When this happens, Controller will ensure that the orphan resource has a new Owner. Multiple parent resources can compete with each other for the same orphan resource, but only one will succeed (other parent resources will receive validation errors).

Informers

You may have noticed that some Controller (such as RBAC Authorizer or Deployment Controller) need to retrieve the cluster status before it can function properly. Take the RBAC Authorizer, for example, which caches the initial state of the user when the request comes in, and then uses it to retrieve all roles (Role) and role bindings (RoleBinding) associated with the user in etcd. So the question is, how does Controller access and modify these resource objects? In fact, Kubernetes solves this problem through the Informer mechanism.

Infomer is a pattern that allows Controller to look up data cached in local memory (maintained by Informer itself) and list the resources they are interested in.

Although the design of Informer is abstract, it implements a lot of detail processing logic (such as caching) internally, and caching is important because it not only reduces direct calls to Kubenetes API, but also reduces a lot of repetitive work on Server and Controller. By using Informer, different Controller interact in a thread-safe (Thread safety) manner without having to worry about conflicts when multiple threads access the same resource.

For more detailed parsing of Informer, please refer to this article: Kubernetes: Controllers, Informers, Reflectors and Stores

Scheduler

When all the Controller is running properly, one Deployment, one ReplicaSet, and three Pod resource records are saved in the etcd and can be viewed through the kube-apiserver. However, these Pod resources are still in the Pending state because they have not yet been scheduled to run on the appropriate Node in the cluster. This problem will eventually be solved by the Scheduler.

Scheduler runs on the cluster control plane as a separate component, working in the same way as other Controller: listening on the actual and adjusting the system state to the desired state. Specifically, the function of Scheduler is to bind (Binding) the Pod to be scheduled to an appropriate Node in the cluster according to a specific algorithm and scheduling policy, and write the binding information to etcd (it will filter the Pod with empty NodeName field in its PodSpec). The default scheduling algorithm works as follows:

When Scheduler starts, a default pre-selection policy chain is registered, and these pre-selection policies evaluate the candidate nodes to determine whether they meet the needs of the alternative Pod. For example, if the PodSpec field limits the CPU and memory resources, then when the resource capacity of the alternate node does not meet the requirements of the alternative Pod, the alternate Pod will not be dispatched to that node (resource capacity = total resources of the alternate node-the sum of the demand resources (CPU and memory) of all containers that already exist Pod in the node)

Once the candidate nodes that meet the requirements are selected, the score of each candidate node will be calculated by using the optimization strategy, and then these candidate nodes will be sorted, and the one with the highest score wins. For example, in order to distribute the workload across the system, these preferred strategies select the node with the least resource consumption from the list of alternative nodes. When each node passes the optimization strategy, it will calculate a score, calculate each score, and finally select the node with a large score as the result of the selection.

Once the appropriate node is found, Scheduler creates a Binding object whose Name and Uid match Pod, and its ObjectReference field contains the name of the selected node, which is then sent to apiserver through a POST request.

When kube-apiserver receives this Binding object, the registration bar deserializes the object and updates the following fields in the Pod resource:

Set the value of NodeName to NodeName in ObjectReference.

Add relevant comments.

Set the status value of PodScheduled to True. You can view it through kubectl:

$kubectl get-o go-template=' {{range .status.status}} {{if eq .type "PodScheduled"}} {{.status}} {{end}} {{end}}'

Once Scheduler dispatches the Pod to a node, the node's Kubelet takes over the Pod and starts deployment.

Both preselected and preferred policies can be extended with the-- policy-config-file parameter, and custom schedulers can be deployed if the default scheduler does not meet the requirements. If the value of podSpec.schedulerName is set to another scheduler, Kubernetes transfers the scheduling of that Pod to that scheduler.

6. KubeletPod synchronization

Now that all the Controller is done, let's summarize:

The HTTP request passed the authentication, authorization, and admission control phases.

One Deployment, ReplicaSet, and three Pod resources are persisted to etcd storage.

Then run a series of Initializers.

Finally, each Pod is scheduled to the appropriate node.

So far, however, all state changes are only for resource records stored in etcd, and the next steps involve the distribution of Pod running between working nodes, which is a key factor in distributed systems such as Kubernetes. These tasks are done by Kubelet components, so let's get started!

In a Kubernetes cluster, a Kubelet service process is started on each Node node, which is used to handle tasks sent by Scheduler to this node and manage the life cycle of Pod, including mounting volumes, container logging, garbage collection, and other Pod-related events.

If you change your mindset, you can think of Kubelet as a special Controller that asks kube-apiserver to NodeName a list of Pod to run on its own Node every 20 seconds (customizable). Once the list is obtained, it detects the newly added Pod by comparing it with its own internal cache, and starts synchronizing the Pod list if there are any differences. Let's analyze the synchronization process in detail:

If Pod is being created, Kubelet records some metrics used in Prometheus to track Pod startup latency.

It then generates a PodStatus object that represents the state of the current phase of the Pod. The Phase of Pod is the most concise summary of Pod in its life cycle, including the values Pending,Running,Succeeded,Failed and Unkown. The process of generating states is very process, so it is necessary to have an in-depth understanding of the principles behind it:

First, a series of Pod synchronous processors (PodSyncHandlers) are executed serially, and each processor checks to see if the Pod should be running on that node. When all processors agree that the Pod should not be running on that node, the Phase value of the Pod becomes PodFailed and the Pod is expelled from the node. For example, when you create a Job, if the Pod fails to retry for longer than the value set by spec.activeDeadlineSeconds, the Pod will be expelled from the node.

Next, the Phase value of Pod is determined by the state of both the init container and the application container. Because the container has not been started yet, the container is considered to be in the waiting phase, and if at least one container in the Pod is in the waiting phase, its Phase value is Pending.

Finally, the Condition field of the Pod is determined by the status of all containers in the Pod. Our container has not yet been created by the container runtime, so the state of PodReady is set to False. You can view it through kubectl:

$kubectl get-o go-template=' {{range .status.status}} {{if eq .type "Ready"}} {{.status}} {{end}} {{end}}'

After the PodStatus is generated (the status field in Pod), Kubelet sends it to the state manager of Pod, which is tasked with asynchronously updating records in etcd through apiserver.

Next, run a series of admission processors to ensure that the Pod has the appropriate permissions (including enforcing the AppArmor profile and NO_NEW_PRIVS), and the Pod rejected by the admission controller will remain in the Pending state.

If the cgroups-per-qos parameter is specified when Kubelet starts, Kubelet creates a cgroup for that Pod and imposes corresponding resource restrictions. This is to make quality of service (QoS) management of Pod more convenient.

Then create the appropriate directory for Pod, including Pod's directory (/ var/run/kubelet/pods/), the Pod's volume directory (/ volumes), and the Pod's plug-in directory (/ plugins).

The volume manager mounts the relevant data volumes defined in Spec.Volumes and waits for whether the mount is successful. Depending on the type of mount volume, some Pod may have to wait longer (such as NFS volumes).

Retrieve all the Secret defined in the Spec.ImagePullSecrets from the apiserver and inject it into the container.

Finally, start starting the container through the container runtime interface (Container Runtime Interface (CRI)) (described in more detail below).

CRI and pause containers

At this stage, a great deal of initialization is done, and the container is ready to start, and the container is started by the container runtime, such as Docker and Rkt.

To make it easier to extend, Kubelet has interacted with the container runtime (Container Runtime) through the container runtime interface since 1.5.0. In short, CRI provides an abstract interface between Kubelet and a specific runtime through a protocol buffer (which is like a faster JSON) and gRPC API (an API that is ideal for performing Kubernetes operations). This is a very cool idea, and by using the contractual relationship defined between Kubelet and the runtime, the exact implementation details of how the container is choreographed has become irrelevant. Since there is no need to modify the core code of Kubernetes, developers can add new runtimes with minimal overhead.

Sorry to digress, let's go back to the container startup phase. The first time you start Pod, Kubelet invokes RunPodSandbox through the Remote Procedure Command (RPC) protocol. Sandbox is used to describe a set of containers, for example, it represents Pod in Kubernetes. Sandbox is a broad concept, so it still makes sense for other runtimes that do not use containers (for example, in an hypervisor-based runtime, sandbox may mean a virtual machine).

The container used in our example is Docker, and the pause container is created first when the sandbox is created. The pause container, as the base container for all other containers in the same Pod, provides a large number of Pod-level resources for each business container in Pod, which are Linux namespaces (including network namespaces, IPC namespaces, and PID namespaces).

The pause container provides a way to manage all these namespaces and allow business containers to share them. The advantage of being in the same network namespace is that containers in the same Pod can use localhost to communicate with each other. The second function of the pause container is related to the way the PID namespace works. In the PID namespace, a tree structure is formed between processes. Once a child process becomes an "orphan process" due to the error of the parent process, it will be adopted by the init process and eventually recycle resources. More information about how pause works can be found at The Almighty Pause Container.

Once the pause container is created, the next step is to check the disk status and start the business container.

CNI and Pod networks

Now our Pod has a basic skeleton: a pause container that shares all namespaces to allow business containers to communicate in the same Pod. But now there is another question, that is, how is the network of the container established?

When Kubelet creates a network for Pod, it leaves the task of creating the network to the CNI plug-in. CNI represents the container network interface (Container Network Interface), which is similar to the way the container runtime operates, but it is also an abstraction that allows different network providers to provide different network implementations for the container. By transferring the data from the json configuration file (default is under the / etc/cni/net.d path) to the relevant CNI binary file (default is under the / opt/cni/bin path), the cni plug-in can configure the relevant network for the pause container, and then the other containers in the Pod use the network of the pause container. Here is a simple example configuration file:

{"cniVersion": "0.3.1", "name": "bridge", "type": "bridge", "bridge": "cnio0", "isGateway": true, "ipMasq": true, "ipam": {"type": "host-local", "ranges": [[{"subnet": "${POD_CIDR}"]] "routes": [{"dst": "0.0.0.0max 0"}]}}

The CNI plug-in also specifies additional metadata, including Pod names and namespaces, for Pod through the CNI_ARGS environment variable.

The following steps vary from CNI plug-in to bridge plug-in:

The plug-in first sets the local Linux bridge in the root network namespace (that is, the network namespace of the host) to provide network services to all containers on that host.

It then inserts a network interface (one end of the veth device pair) into the network namespace of the pause container and connects the other end to the bridge. You can understand the veth device pair this way: it is like a long pipe, one end is connected to the container, the other end is connected to the root network namespace, and the packet is propagated in the pipe.

Next, the IPAM Plugin specified in the json file assigns an IP to the network interface of the pause container and sets the corresponding route, and now Pod has its own IP.

IPAM Plugin works like CNI Plugin: through binaries and with standardized interfaces, each IPAM Plugin must determine the IP, subnet, gateway and route of the container network interface and return the information to the CNI plug-in. The most common IPAM Plugin is host-local, which assigns IP addresses to containers from a predefined set of address pools. It stores the address pool information and allocation information in the host's file system, thus ensuring the uniqueness of the IP address of each container on the same host.

Finally, Kubelet passes the Cluster IP addresses of the DNS servers within the cluster to the CNI plug-in, and the CNI plug-in writes them to the container's / etc/resolv.conf file.

Once the above steps are completed, the CNI plug-in returns the result of the operation to Kubelet in the format of json.

Cross-host container network

So far, we have described how containers communicate with hosts, but how do containers communicate across hosts?

Typically, overlay networks are used to communicate across host containers, which is a way to dynamically synchronize routes between multiple hosts. One of the most commonly used overlay network plug-ins is flannel,flannel. You can refer to the CoreOS documentation for how it works.

Container start

After all the networks are configured, the business container is actually started!

Once the sanbox is initialized and in the active state, Kubelet can start creating containers for it. Start the init container defined in PodSpec first, and then start the business container. The specific process is as follows:

First, pull the image of the container. If it is an image of a private repository, the Secret specified in PodSpec will be used to pull the image.

Then create the container through the CRI interface. Kubelet populates PodSpec with an ContainerConfig data structure (where commands, mirrors, labels, mount volumes, devices, environment variables, etc.) are defined, and then sent to the CRI interface via protobufs. For Docker, it deserializes and populates this information into its own configuration information, and then sends it to the Dockerd daemon. In the process, it adds some metadata tags (such as container type, log path, dandbox ID, etc.) to the container.

Next, you will use the CPU manager to constrain the container, which is a new alpha feature in Kubelet 1.8, which uses the UpdateContainerResources CRI method to assign the container to the CPU resource pool on this node.

Finally the container really starts to start.

If container lifecycle hooks (Hook) are configured in Pod, these Hook will be run after the container starts. There are two types of Hook: Exec (executing a command) and HTTP (sending a HTTP request). If the PostStart Hook starts for too long, hangs, or fails, the container will never become running.

That's all for "how kubectl creates Pod". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.