The K8s Core principle of kubernetes-- the first part (5) 07/15 Update SLTechnology News&Howtos

The K8s Core principle of kubernetes-- the first part (5)

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

thought it could be done in one article, but he underestimated his own nonsense. Well, he can only introduce the core principles of K8s to you through two articles.

1. Analysis of Kubernetes API Server principle. 1. Introduction to kubernetes API Server

The core function of kubernetes API server is to provide HTTP Rest interfaces for adding, deleting, modifying, querying and watch of all kinds of kubernetes resource objects (pod, RC, service, etc.), which is the data bus and data center of the whole system. Sometimes when we use kubectl to create or view resources such as pod, we find that there is no response, which may be caused by the abnormal exit of your kube-apiservice service.

Kubernetes API server provides services through a process called kube-apiservice, which runs on the master node. By default, the port of the process is 8080 native to provide restful services. (note that if it is HTTPS, it is port 6443).

Some of the next operations of introduce how to interact with kubernetes API server through rest, which facilitates the understanding of communication between components in K8s:

[root@zy ~] # kubectl cluster-info # View master node information

[root@zy ~] # curl localhost:8080/api # View the version information of kubernetes API

[root@zy ~] # curl localhost:8080/api # View all the resource objects supported by kubernetes API

Of course, we can also access specific resources.

[root@zy ~] # curl localhost:8080/api/v1/pods [root@zy ~] # curl localhost:8080/api/v1/services [root@zy ~] # curl localhost:8080/api/v1/replicationcontrollers

When we run kubectl get svc, we will find:

has more than one red box service, what is this? originally, in order to let the process in pod know the access address of kubernetes API server, kubernetes API server itself is also a service, the name is "kubernetes", and its cluster IP is the first address of the cluster IP address pool example, and the port it serves is 443.

2. Kubernetes proxy API interface

kubernetes API server also provides a very special kind of rest interface-proxy interface, this structure is to proxy REST requests, that is, kubernetes API server forwards the received rest requests to the rest port of the kubelet daemon on a node, and the kubelet process is responsible for the corresponding.

For example:

MasterIP:8080/api/v1/proxy/nodes/ {node_name} / pods # all pod information under a node masterIP:8080/api/v1/proxy/nodes/ {node_name} / stats # Statistics of physical resources within a node masterIP:8080/api/v1/proxy/nodes/ {node_name} / spec # Summary information of a node

# next, let's talk about the more important pod-related interfaces

MasterIP:8080/api/v1/proxy/namespaces/ {namespace} / pods/ {pod_name} / {path:*} # access one of pod's service interfaces masterIP:8080/api/v1/proxy/namespaces/ {namespace} / pods/ {pod_name} # access pod

If there is a pod of Tomcat named myweb

We can access the http service of the pod by typing masterIP:8080/api/v1/proxy/namespaces/ {namespace} / pods/myweb in the browser.

If this is a service composed of several pod of web:

MasterIP:8080/api/v1/proxy/namespaces/ {namespace} / services/ {service_name}

You will be able to access the services below, and of course you will eventually be located under the corresponding pod through kube-proxy.

3. Communication between Cluster function Modules

As the core and core of the cluster, Kubernetes API Server is responsible for the communication between the functional modules of the cluster. Each functional module in the cluster stores information into etcd through API server. Similarly, when you want to obtain and manipulate these data, it is also achieved through the REST interface of API Server (GET, LIST, WATCH), so as to achieve the interaction between each module.

Next, through a picture, the editor briefly introduces several typical interaction scenarios:

Scenario 1 (kubelet-- API Server): for each time period, the kubelet on each node node will call API Server's REST API to report its own status. After receiving this information, API Server will update the node information to etcd. In addition, kubelet also listens for Pod information through API Server's Watch API. If a new copy of Pod is called and bound to this node, the creation and startup of the container corresponding to pod will be executed; if the delete operation of Pod is monitored, the corresponding Pod container on this node will be deleted; if a modification operation is detected, kubelet will modify the Pod container of this node accordingly. Scenario 2 (kube-controller-manager-API Server): the Node Controller module in kube-controller-manager monitors Node information in real time through the WATCH interface provided by the API Server module. And deal with it accordingly. Scenario 3 (scheduler-- API Server): when Scheduler listens to the information of the newly created Pod copy through the Watch API of API Server, it will retrieve all the Node lists that meet the Pod requirements, start executing the Pod scheduling logic, and bind the Pod to the target node after the scheduling is successful.

introduced the above scenario, and you can't help thinking that many functional modules here will frequently use API Server, and the API Server service is so important that the container will fail if you work under pressure for a long time. Good question, K8s in order to ease the pressure on each module of the cluster to access API Server, the caching mechanism is used between each module. Each module regularly obtains the resource object information from the API Server and caches it locally, so that each functional module first obtains the resource object information from the local, and then accesses the API Server when it is not available locally. Second, the analysis of Controller Manager principle.

introduction: Controller Manager, as the internal management and control center of the cluster, is responsible for the management of Node, Pod copy, EndPoint, namespace, service account, resource quota and so on in the cluster. When an Node goes down unexpectedly, Controller Manager discovers the fault in time and performs an automatic repair process to ensure that the cluster is always working as expected.

as shown in the figure above, the Controller Manager contains many controller, each of which is responsible for a specific control process, and Controller Manager is the core manager of these controller. Generally speaking, intelligent systems and automatic systems are called a "control system" mechanism to constantly modify the working status of the system. In the kubernetes cluster, each controller has such a "manipulation system". They monitor the current state of every resource object in the whole cluster in real time through the interface provided by API Server. When various failures cause changes in the system state, these controller will try to modify the system from the "existing state" to the "expected state".

Next the editor will introduce some of the more important Controller.

1. Replication Controller

When introducing replication controller, should emphasize that the RC of the resource object must not be confused with this replication controller. The replication controller we introduce here is the controller of the copy, the RC is just a resource object, and the upper layer is the replication controlle that manages each RC. Here we uniformly call replication controller the replica controller, and the resource object RC is called RC.

The core role of the controller for replicas is to ensure that the number of Pod replicas associated with any one RC in the cluster remains at a preset value. If it is found that the number of Pod copies exceeds the preset value, the controller of the copy destroys some copies of Pod and otherwise creates some new copies of Pod to reach the target value. It is worth noting that the controller of the replica manages the operation of the Pod only if the restart policy of the pod is always. In general, the pod object will not disappear after it is successfully created. The only exception is that when the pod is in the success or failed state for too long (the timeout period can be set), the pod will be automatically reclaimed by the system, and the controller that manages the copy of the pod will recreate and start the pod on other work nodes.

The Pod template in RC is like a mold. Once something made by the mold leaves the mold, there will be no relationship between the two. Once the pod is created, no matter how the template changes will not affect the pod that has been created, and deleting a RC will not affect the Pod created by it. Of course, if you want to delete all Pod under the control of RC, you need to set the number of copies of the pod in the RC to 0. Only in this way will all Pod be deleted automatically.

Responsibilities of replication controller (the controller of the copy):

-ensure that the number of Pod currently managed is the default

-system expansion and reduction by calling the spec.replicas attribute of RC

-realize the rolling upgrade of the system by changing the image in the template of Pod in RC

The usage scenario of replication controller (controller of the copy):

-rescheduling: regardless of whether a node is down or pod dies unexpectedly, RC can guarantee that the number of running Pod it manages is the default value.

-Auto scaling: achieve cluster expansion and reduction (according to the available resources and load pressure of the cluster)

-Rolling upgrade: the application service upgrades the new version and ensures that the application service can still provide services to the outside world throughout the upgrade process.

2. Node Controller

When the Kubelet process starts, it registers its own node information through API Server, and periodically reports the status information to API Server. After receiving these information, API Server updates the information to etcd. The node information stored in Etcd includes: node health status, node resources, node name, node address information, operating system version, docker version, kubelet version and so on. There are three health states of nodes: ready (True), not ready (False), and unknown (Unknown).

Next, the editor uses the diagram to introduce the core workflow of Node Controller:

Concrete steps

-if controller manager sets-- cluster-cidr at startup, a CIDR address is generated for each node that does not have spec.PodCIDR set, and the node's spec.PodCIDR attribute is set with that address.

-read the node information one by one, and there is a nodestatusMap in the node controller that stores the information, compares it with the newly sent node information, and updates the node information in nodestatusMap. There are three kinds of node information sent by Kubelet: not sent, sent but the node information does not change, and the node information changes. At this time, node controller updates the nodestatusMap according to the node information sent. If it determines that the information of a node has not been received within a certain period of time, the node status is set to "unknown".

-finally, the node in the not ready state is added to the queue to be deleted, and after it is deleted, the information of the node in the etcd is deleted through API Server. If the node is ready, the node information is synchronized to the etcd.

3. ResourceQuota controller

Kubernetes provides advanced functions of resource quota management (resourceQuota controller) here. Resource configuration management ensures that specified resource objects will not overoccupy system physical resources at any time, avoiding unexpected downtime due to defects in the design or implementation of some business processes, which plays a vital role in the stability of the entire cluster.

Currently, kubernetes supports the following three levels of resource quota management:

-Container level: restrictions on CPU and memory

-Pod level: you can limit the resources available to all containers in a pod

-Namespace level: a resource limit for the namespace (multi-tenant) level, where the limited resources include:

Number of △ Pod

Number of △ RC

Number of △ Service

Number of △ ResourceQuota

Number of △ Secret

Number of PV that △ can hold

Quota management for Kubernetes is controlled by admission control (admission Control). Admission control currently provides two ways of quota constraints, namely limitRanger and resourceQuota. LimitRanger acts on the pod and the container. ResourceQuota acts on namespace and is used to limit the total amount of resources used in a namespace.

The quota management of Kubernetes is controlled by admission control (admission Control). Admission control currently provides two ways of quota constraints, namely limitRanger and resourceQuota. LimitRanger acts on the pod and the container. ResourceQuota acts on namespace and is used to limit the total amount of resources used in a namespace.

From the figure above, we can see that there are about three routes, and resourceQuota controller plays an important role in all three routes:

If the user declares limitranger while defining the pod, the user requests to create or modify the resource object through the API Server request, which means that admission control will calculate the usage of the current quota. If it does not meet the constraint, the creation fails. (1) for the namespace,resourceQuota controller that defines resourceQuota, the total amount of resources used by all kinds of objects under the namespace will be counted and generated periodically. The statistical results include the number of instances of objects such as pod, service, RC, secret and PV, as well as the amount of resources used by all container instances under the namespace (CPU,memory). Then these results will be written to etcd, including the resource object name, quota system, and usage value. Then admission control will determine whether the quota is exceeded based on the statistical results to ensure that the total amount of resources allocated under the relevant namespace does not exceed the limit value of resource Quota. (II, III) 4. Namespace controller

users can create a new namespace through API Server and save it in etcd, and namespace controller periodically reads these namespace information through API Server. If the namespace is marked as gracefully deleted by API (by setting the deletion period), the state of the namespace is set to "terminating" and saved to etcd. At the same time, namespace controller deletes resource objects such as serviceAccount,RC,pod,secret,PV,listRange,resourceQuota and event under the namespace.

when the status of the namespace is "terminating", the namespaceLifecycle plug-in of admission controller prevents the creation of new resources for that namespace. At the same time, after namespace controller deletes all the resource objects in the namespace, namespace controller performs a finalize operation on the namespace, deleting the information in the spec.finallizes domain of namespace.

, of course, there is a special case here. When a namespace controller discovers that namespace has set a deletion period and the spec.finalizers domain value of the namespace is empty, then namespace controller will delete the resources of the namespace through API Server.

5. Service controller and endpoint controller

The figure above shows the relationship between service and endpoint and pod. Endpoints represents the access address of all pod copies corresponding to a service, and endpoints controller is the controller responsible for generating and maintaining all endpoints objects.

is responsible for listening for changes in the service and the corresponding pod copy, and if the service is detected to be deleted, delete the endpoints object with the same name as the service. If a new service is detected to be created or modified, the relevant pod list is obtained according to the information of the service, and then the endpoints object corresponding to the service is created or updated. If an event for pod is detected, update its endpoints object corresponding to service (add or delete or modify the corresponding endpoint entry).

kubernetes scheduler takes on the role of "connecting link" in the whole system. "inheriting the above" means that it is responsible for receiving the new pod created by controller manager and arranging a "home" for it to settle down. "starting and going down" means that after the placement work is completed, the kubelet service process on the target node takes over the successor work and is responsible for the "second half of life" in the pod life cycle.

We all know that after we associate service and pod through label, we can access the clusterIP corresponding service of service, forward the route to the corresponding backend endpoint (pod IP + port) through kube-proxy, and finally access the service in the container, realizing the load balancing feature of service.

then let's talk about the role of service controller, which actually belongs to an interface controller between kubernetes and the external cloud platform. Service controller listens for changes in service. If it is a service of loadBalancer type, service controller ensures that the corresponding loadbalance instance of the service on the external cloud platform is created, deleted and updated accordingly (according to the entry of endpoint).

Third, Scheduler principle analysis 1. Introduction

Specifically, the function of kubernetes scheduler is to bind the pod to be scheduled (created by new and supplementary copies) to an appropriate node in the cluster according to the specific scheduling algorithm and scheduling policy, and write the binding information to the etcd. The whole scheduling process is divided into three objects, namely: pod list to be scheduled, available appropriate node list, scheduling algorithm and policy. In a word, the pod in the pod list to be scheduled is created and started on the appropriate node through the appropriate scheduling algorithm and policy.

Next, the editor gives a brief introduction to the workflow of scheduler through a picture:

There is a picture to know:

 traverses all the target node to filter out the candidate nodes that meet the requirements. To this end, kubernetes has built-in a variety of pre-selection strategies

 determines the priority node, on the basis of the first step, uses the optimization strategy to calculate the score of each node candidate, and the one with the highest score wins.

Finally,  notifies the Pod to be scheduled through API Server to the kubelet on the optimal node, and then creates and runs it.

2. Scheduler pre-selection strategy

There are many preselection calculations available for in scheduler: NoDiskconflict, PodFitsResources, PodSelectorMatches, PodFitsHost, CheckNodeLabelPresence, CheckServiceAffinity, PodFitsPorts, and so on. Among them, five default preselection policies: PodFitsPorts, PodFitsResources, NoDiskconflict, PodSelectorMatches, PodFitsHost, each node can be initially selected and proceed to the next process only after passing these five preselection policies.

The following editor introduces several common pre-selection strategies:

(1) NoDiskconflict

determines whether there is a conflict between the gcePersistentDisk or AWSElasticBlockStore of the alternative pod and the pod that already exists in the alternative node. The specific detection process is as follows:

-first, read all the volume information of the alternate pod and perform the collision detection of the following steps for each volume

-if the volume is gcePersistentDisk, compare the volume with each volume of all pod on the alternate node, if the same gcePersistentDisk is found, return false, indicating disk conflict, detection is over, feedback to the scheduler that the alternate node is not suitable as an alternative pod, if the volume is AWSElasticBlockStore, compare the volume with each volume of all pod on the alternate node, if the same AWSElasticBlockStore is found, return false, indicating disk conflict At the end of the test, feedback to the scheduler that the candidate node is not suitable as an alternative pod

-finally, if all volume of the alternate pod is found to be conflict, true is returned, indicating that there is no disk conflict and feedback to the scheduler that the candidate node is suitable for the alternative pod.

(2) podFistResources

determines whether the alternate node resource meets the requirements of the alternative pod. The detection process is as follows:

-calculates the sum of the required resources (CPU and memory) for all containers of the alternative pod and pod that already exist in the node

-get the status information of the alternate node, including the resource information of the node

-if the sum of the demand resources (CPU and memory) of the alternative pod and all containers in which the pod already exists in the node exceeds the resources owned by the alternate node, false is returned, indicating that the alternate node is not suitable for the alternative pod, otherwise the true is returned, indicating that the alternate node is suitable for the alternative pod

(3) PodSelectorMatches

determines whether the alternate node contains the tag specified by the tag selector of the alternate pod:

-returns true if pod does not specify a spec.nodeSelector tag selector

-if you get the tag information of the alternate node, determine whether the node contains the tag referred to by the tag selector of the alternate pod, and if it contains the return true, it does not include the return false

(4) PodFitsHost

determines whether the node name specified in the spec.nodeName field of the alternate pod matches the name of the alternate node, and returns false if the same returns true.

(5) PodFitsPorts

determines whether the port in the port list sink used by the alternate pod is occupied in the alternate node, and if so, returns false, otherwise returns true.

3. Optimizer selection strategy

The preferred strategies in Scheduler are leastRequestedPriority, CalculateNodeLabelPriority, BalancedResourceAllocation and so on. When each node passes the priority strategy, it calculates a score, calculates each score, and finally selects the node with the highest score as the optimization result.

Next, the editor will introduce some commonly used selection strategies:

(1) leastRequestedPriority

this strategy is used to select the node with the least resource consumption from the list of candidate nodes:

-calculates the CPU usage of pod and alternate pod running on all alternate nodes

-calculates the memory usage of pod and alternate pod running on all alternate nodes

-calculates the score for each node based on a specific algorithm

(2) CalculateNodeLabelPriority

if the user specifies the policy in the configuration, scheduler registers the policy through the registerCustomPriorityFunction method. The strategy is used to determine whether to select the candidate node when the label listed by the policy exists in the candidate node. If the label of the alternative node is in the label list of the preferred policy and the premise value of the preferred policy is true, or if the label of the alternative node is not in the label list of the preferred policy and the premise value of the preferred policy is false, then the alternate node score=10, otherwise equal to 0.

(3) BalancedResourceAllocation

this preference strategy is used to select the node with the most balanced utilization of each resource from the list of candidate nodes:

-calculates the CPU usage of pod and alternate pod running on all alternate nodes

-calculates the memory usage of pod and alternate pod running on all alternate nodes

-calculates the score for each node based on a specific algorithm

IV. Analysis of the operation mechanism of Kubelet

in a kubernetes cluster, a kubelet service process is started on each node. This process is used to handle the tasks sent by the master node to this node, and to manage the containers in Pod and Pod. Each kubelet process registers node information on API Server, periodically reports the use of node resources to master nodes, and monitors the resources of containers and nodes through cAdvisor.

1. Node management

The node decides whether to register itself with API Server by setting the kubelet startup parameter "--register-node". If this parameter is true, then kubelet will try to register itself with API Server. On-boarding, kubelet startup also includes the following parameters:

Location of -api-servers:API Server

-kubeconfing:kubeconfig file, used to access API Server's security configuration file

-- cloud-provider: cloud service provider address, used only in a shared cloud environment

If does not choose self-registration mode, the user needs to manually configure the resource information of node and tell the location of kubelet API Server on ndoe. Kubelet registers node information through API Server at startup, and periodically sends new node messages to API Server. After receiving these messages, API Server writes these information into etcd. Set how often kubelet reports node status to API Server through the kubelet startup parameter "--node-status-update-frequency", which defaults to 10s.

2. Pod management

kubelet gets a list of pod to run on its own node in the following ways:

 file: files in the configuration file directory specified by the kubelet startup parameter "--config" (default is "/ etc/Kubernetes/manifests") are checked by-- file-check-frequency setting the time interval between which the file is checked. The default is 20s.

 HTTP endpoints: set through the "--manifest-url" parameter. The interval at which the HTTP endpoint data is checked is set by "--http-check-frequency", which defaults to 20s.

 API Server:kubelet listens to etcd directories and synchronizes pod lists through API server

Note: here static pod is not created by API Server, but by kubelet. As mentioned in the previous article, static pod is written in the configuration file of kubelet and always runs on the node where kubelet is located.

Kubelet listens for etcd, and all operations against pod will be monitored by kubelet. If it is a new pod bound to this node, the pod is created as required by the pod listing. If the pod is deleted, kubelet deletes the container in the pod through docker client and deletes the pod.

specifically aims at creating and modifying pod tasks, and the process is as follows:

-create a directory for the pod

-read the pod list from API Server

-Mount an external volume for this pod

-download the secret used by pod

-check the pod that is already running in the node. If the pod does not have a container or the Pause container is not started, stop all container processes in the pod first. If there are containers in pod that need to be deleted, delete them

-do the following for each container in pod

△ calculates a hash value for the container, and then queries the docker container's hash value with the name of the container. If the container is found and the two get different hash, stop the process of the container in docker and stop the process associated with the pause container; if the two are the same, no processing will be done

△ if the container is stopped and the container does not specify a restartPolicy (restart policy), no processing will be done

△ calls docker client download container image, calls docker client to run the container

3. Health examination of containers

Pod uses two types of probes to check the health of the container. One is the livenessProbe probe, which is used to determine whether the container is healthy or not, and tells kubelet when a container is in an unhealthy state. If the livenessProbe probe detects that the container is unhealthy, kubelet will delete the container and handle it accordingly according to the container's restart strategy. If a container does not contain a livenessProbe probe, kubelet thinks that the return value of the livenessProbe probe will always be "success". The other probe is ReadinessProbe, which is used to determine whether the container has started and is ready to accept the request. If the ReadinessProbe probe detects a failure, the state of the pod is modified and the endpoint controller removes the endpoint entry containing the IP address of the pod where the container is located from the endpoints of the service.

Kubelet periodically calls the livenessProbe probe in the container to diagnose the health status of the container. LivenessProbe includes the following three implementation methods:

-Execaction: execute a command in the container. If the exit status code of the command is 0, the container is healthy.

-TCPSocketAction: perform a TCP check through the IP address and port of the container. If the port can be accessed, the container is normal.

The specific configuration is described in detail in the previous article: https://blog.51cto.com/14048416/2396640

Fifth, the analysis of Kube-proxy operation mechanism 1. Concept introduction

introduces kube-proxy. I have to say service. Here the editor first reviews service. Because the IP address of pod is not fixed every time it is created, for convenience of access and load balancing, the concept of service is introduced here. Service has a clusterIP after it is created. This IP is fixed and is associated with the backend pod through labelselector, so if we want to access the back-end application services, we only need to use service's clusterIP. The request is then forwarded to the back-end pod, even if it is a reverse proxy and a load balancer.

but in many cases service is just a concept, and what really implements the role of service is the kube-proxy service process behind it. Then let's introduce kube-proxy in detail.

has a kube-proxy process on each node in the kubernetes cluster, which can be regarded as a transparent proxy and load balancer for service. Its core function is to forward access requests to a service to multiple pod instances at the back end. For each kubernetes service,kube-proxy of TCP type, a socketserver is established on the local node to receive requests, and then sent evenly to the port of a pod at the backend. This process defaults to the round robin load balancing algorithm. In addition, kubernetes also provides directed sending of the session persistence feature by modifying the value of the service.spec.sessionAffinity parameter of service. If the value is set to "clientIP", then future requests from the same clientIP will be forwarded to the same backend pod.

in addition, concepts such as clusterIP and nodePort of service are implemented by kube-proxy services through NAT transformation of Iptables. Kube-proxy is dynamically created in service-related Iptable rules during operation. These rules implement the function of redirecting request traffic from clusterIP and nodePort to the proxy port of the corresponding service on the kube-proxy process. Because the Iptable mechanism is aimed at local kube-proxy ports, kube-proxy components are run on every node, so that within the kubernetes cluster, we can initiate access to service on any node. From this point of view, due to the role of kube-proxy, the client is out of order to care about how many pod there are in the back end during the call of service, and the communication, load balancing and fault recovery of the intermediate process are all transparent.

two。 Pod selection of backend

currently only supports round robin algorithm in kube-proxy load balancer. The round robin algorithm selects members one by one according to the list of members, and if one cycle ends, it starts the next cycle from the beginning, and so on. Kube-proxy 's load balancer also supports session retention on the basis of round robin algorithm. If service specifies session persistence in the definition, when kube-proxy accepts the request, it looks for the existence of an affinitystate object from the request IP from local memory, and if the object exists and the session does not time out, the kube-proxy redirects the request to the pod of the backend that the affinitystate points to. If there is no affinitystate object from the request IP locally, pick an endpoint for the request according to the round robin algorithm, and create an affinitystate object that records the IP of the request and the endpoint that it points to. The subsequent request will be "glued" to the created affinitystate object, which implements the client-side IP session persistence function.

3. Kube-proxy implementation details

kube-proxy establishes a "service proxy object" for each service and automatically synchronizes it by querying and listening for the transformation of service and endpoint in API Server. Service broker pair correlation is a data structure within the kube-proxy program, which includes a socketServer to listen for this traffic request, and the port of the socketServer is randomly specified as a local idle port. In addition, a load balancer, loadBalancer, is created within kube-proxy. The dynamic routing table from service to the corresponding backend endpoint list is stored on loadBalancer, and the specific route choice depends on the round robin algorithm and service session session persistence.

: for the list of service that has changed, kube-proxy will process it one by one. Here is the specific process:

-if service does not have a cluster IP set, no processing is done, otherwise, get a list of all port definitions for that service

-read the port information in the service port definition list one by one, and determine whether the corresponding service proxy object already exists locally according to the port name, service name and namespace. If it does not exist, create it. If it exists and the service port has been modified, delete the rules related to the service port in Iptables first, close the service proxy object, and then go through the new process and create relevant Iptables rules for the service.

-updates the forwarding address list of the corresponding service in the load balancer component. For the newly created service, determine the session persistence policy when forwarding.

-cleanup for deleted service

Next, through a specific case, the editor will actually introduce you to the principle of kube-proxy:

# first create a service:

ApiVersion: v1kind: Servicemetadata: labels: name: mysql role: service name: mysql-servicespec: ports:-port: 3306 targetPort: 3306 nodePort: 30964 type: NodePort selector: mysql-service: "true"

The port exposed by the nodePort for the mysql-service is 30964, the port for the corresponding cluster IP (10.254.162.44) is 3306, and the port corresponding to the back-end pod is 3306. The 30964 exposed here is the local port of the proxy object created for the mysql-service service, and if the port is accessed on the ndoe, the route is forwarded to the service.

The mysql-service backend proxies two pod,ip of 192.168.125.129 and 192.168.125.131. Let's take a look at iptables first.

[root@localhost] # iptables-S-t nat

First, if you are accessing through port 30964 of node, you will enter the following chain:

-A KUBE-NODEPORTS-p tcp-m comment-- comment "default/mysql-service:"-m tcp-- dport 30964-j KUBE-MARK-MASQ-A KUBE-NODEPORTS-p tcp-m comment-- comment "default/mysql-service:"-m tcp-- dport 30964-j KUBE-SVC-67RL4FN6JRUPOJYM

And then further jump to the KUBE-SVC-67RL4FN6JRUPOJYM chain.

-A KUBE-SVC-67RL4FN6JRUPOJYM-m comment-- comment "default/mysql-service:"-m statistic-- mode random-- probability 0.50000000000-j KUBE-SEP-ID6YWIT3F6WNZ47P-A KUBE-SVC-67RL4FN6JRUPOJYM-m comment-- comment "default/mysql-service:"-j KUBE-SEP-IN2YML2VIFH5RO2T

Here, we make use of the characteristic of iptables-- probability, so that the connection has a 50% probability of entering the KUBE-SEP-ID6YWIT3F6WNZ47P chain and 50% probability of entering the KUBE-SEP-IN2YML2VIFH5RO2T chain.

The specific purpose of the chain of KUBE-SEP-ID6YWIT3F6WNZ47P is to send requests over DNAT to port 3306 of 192.168.125.129.

-A KUBE-SEP-ID6YWIT3F6WNZ47P-s 192.168.125.129 comment 32-m comment-- comment "default/mysql-service:"-j KUBE-MARK-MASQ-A KUBE-SEP-ID6YWIT3F6WNZ47P-p tcp-m comment-- comment "default/mysql-service:"-m tcp-j DNAT-- to-destination 192.168.125.129 default/mysql-service 3306

In the same way, the function of KUBE-SEP-IN2YML2VIFH5RO2T is to send to port 3306 of 192.168.125.131 via DNAT.

-A KUBE-SEP-IN2YML2VIFH5RO2T-s 192.168.125.131 comment 32-m comment-- comment "default/mysql-service:"-j KUBE-MARK-MASQ-A KUBE-SEP-IN2YML2VIFH5RO2T-p tcp-m comment-- comment "default/mysql-service:"-m tcp-j DNAT-- to-destination 192.168.125.131 default/mysql-service 3306

generally means that when creating a service, if nodePort is not specified, the proxy object will listen to a random idle port locally when the proxy object is created, and if nodePort is set, nodePort will be used as the port of the local proxy object. After accessing the port of the local proxy object, the client forwards the request to the clusterIP+port of service according to the iptables forwarding rules, and then forwards the request to the target Port of the backend endpoint according to the forwarding rules specified by the load balancer policy. Finally, it accesses the application service of the container in the specific pod, and then returns the response.

Core Mechanism part 2, shared Storage: https://blog.51cto.com/14048416/2412207

The content of the article is referred to the authoritative Guide to kubernetes.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.