Kubernetest master node recovery disaster recovery operation guide 04/21 Update SLTechnology News&Howtos

Kubernetest master node recovery disaster recovery operation guide

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Kubernetest master node recovery disaster recovery operation guide

This article basically reprints other people's articles, and the source will be marked at the end of the article.

1. Basic description

this document outlines the relevant steps for disaster recovery of the Kubernetes master node for operation in the event of a K8s master crash. The etcd cluster is deployed in K8s, and the master node controls the highly available nodes of the components, and disaster recovery is also a necessary operation in order to form a complete enterprise service solution. When the master node of the K8s cluster fails, it will not affect the existing pod operation and service openness, so it has no impact on the service. Therefore, we can select the appropriate time window for maintenance and recovery after the failure, which can have a minimum impact on external customers.

documents refer to the more formal practices abroad, forming a daily automatic backup mechanism. Main reference URL:Backup and Restore a Kubernetes Master with Kubeadm

2. Basic concepts of Etcd

Etcd is implemented using the raft consistency algorithm, which is a distributed consistent KV storage that is mainly used for shared configuration and service discovery. Please refer to the animation demonstration for the raft consistency algorithm.

For 's interpretation of the principle of Etcd, please refer to the Etcd architecture and implementation resolution.

The V3 version of etcdctl uses commands differently than the V2 version, and etcd2 and etcd3 are not compatible. When using the version of etcdctlv3, you need to set the environment variable ETCDCTL_API=3. Kubernetes defaults to the V3 version.

3. Etcd data backup and recovery

The data of etcd will be stored in our command working directory by default. We find that the directory where the data is located will be divided into two folders:

Snap: stores snapshot data, etcd prevents snapshots set by too many WAL files, and stores etcd data status. Wal: store the pre-written log, the biggest function is to record the whole course of the change of the data. In etcd, all changes to data are written to the WAL before they are committed. 3.1 single node etcd data backup

We use v1.14.3 version of . For ease of deployment and compatibility, we use the images of K8s installation as the running container (k8s.gcr.io/etcd:3.3.10). Using the following yaml file, run on the master of K8s, that is, back up the data of etcd every day.

first looks at the information of the etcd in the cluster, and by the way extracts the information from the etcd pod as a reference for us to execute the command.

Kubectl get pods-n kube-systemkubectl describe pods-n kube-system etcd-docker1 find the command ETCDCTL_API=3 etcdctl-- endpoints= https://[127.0.0.1]:2379-- cacert=/etc/kubernetes/pki/etcd/ca.crt-- cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt-- key=/etc/kubernetes/pki/etcd/healthcheck-client.key in liveness

Write etcd-backup.yaml

ApiVersion: batch/v1beta1kind: CronJob # cronjob metadata: name: etcdbackup namespace: kube-systemspec: schedule: "0 *" # backup data every day jobTemplate: spec: template: spec: containers:-name: backup # Same image as in / etc/kubernetes/manifests/etcd.yaml image: k8s.gcr.io/etcd:3.3. 10 # default image env:-name: ETCDCTL_API value: "3" command: ["/ bin/sh"] args: ["- c" "etcdctl-endpoints= https://127.0.0.1:2379-cacert=/etc/kubernetes/pki/etcd/ca.crt-cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt-key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save / backup/etcd-snapshot-$ (date +% Y-%m-%d_%H:%M:%S_%Z) .db"] volumeMounts:- MountPath: / etc/kubernetes/pki/etcd # Map the server certificate location in name: etcd-certs readOnly: true-mountPath: / backup # Certificate backup location mapping in name: backup restartPolicy: OnFailure nodeSelector: node-role.kubernetes.io/master: "" # Node selects the primary node Tolerations:-key: "node-role.kubernetes.io/master" # hit taints In order for pod to tolerate master effect: "NoSchedule" hostNetwork: true volumes:-name: etcd-certs hostPath: path: / etc/kubernetes/pki/etcd # Mount host certificate file location type: DirectoryOrCreate-name: backup hostPath: path: / tmp/ Etcd_backup/ # backup file location type: DirectoryOrCreate

from the above yaml file, we can see its implementation ideas:

Defined as CronJob, this pod runs automatically every morning (schedule: "0 *"). This pod is run on master (nodeSelector + tolerations implementation). The / tmp/etcd_backup/ on the master machine is mounted as a backup directory, and it is best for the production environment to mount or cp to other machines in time to prevent unexpected conditions on the machine itself. The parameter passed in is the command of ETCDCTLAPI version 3 for backup. The $(date +% Y-%m-%d%H:%M:%S_%Z) .db "in the Args parameter is the backup command. It names the backup data of etcd according to the format of time. 3.2 single node etcd data recovery

if you already have backup data, if only the etcd data is corrupted, restore the backup as follows. The main order to consider: stop kube-apiserver, stop ETCD, restore data, start ETCD, start kube-apiserver.

1. Change the image version in the / etc/kubernetes/manifests/ kube-apiserver.yaml file to stop the kube-apiserver service.

two。 Change the image version in the / etc/kubernetes/manifests/ etcd.yaml file to stop the etcd server service.

3. Run the following command to move the corrupted data file somewhere else.

Mv / var/lib/etcd/\ * / tmp/ run the following command to restore the data from the backup to / var/lib/etcd/ as a temporary docker. Docker run-- rm\-v'/ tmp:/backup'\-v'/ var/lib/etcd:/var/lib/etcd'\-- env ETCDCTL_API=3\ 'k8s.gcr.ioCompact etcdler3.3.10' / bin/sh-c "etcdctl snapshot restore' / backup/etcd-snapshot-xxx_UTC.db'; mv / default.etcd/member/ / var/lib/etcd/"

Change back to the mirror version in the / etc/kubernetes/manifests/kube-apiserver.yaml file to restore the etcd server service.

Change back to the mirror version in the / etc/kubernetes/manifests/etcd.yaml file to restore the kube-api server service. 3.3 backup of cluster etcd node control components

Generally speaking, if the master node needs backup and recovery, in addition to misoperation and deletion, it is likely that the whole machine has failed, so it may need to restore etcd data at the same time.

On recovery, there is a prerequisite that the machine name and ip address need to be the same as the configuration of the primary node before the crash, because this configuration is written to the etcd data store.

Because the etcd files in the high-availability cluster are the same, we can just back up a file.

Refer to 3.1 for specific backup methods.

3.4 recovery of etcd node control components

First, you need to stop the kube-apiserver of each of the three Master machines to make sure that kube-apiserver has been stopped.

Mv / etc/kubernetes/manifests / etc/kubernetes/manifests.bakdocker ps | grep k8s# check whether etcd and api are up, and wait for all mv / var/lib/etcd / var/lib/etcd.bak to be stopped

The etcd cluster is restored with the same snapshot.

# prepare to restore the file cd / tmptar-jxvf / data/backup/kubernetes/2018-09-18-k8s-snapshot.tar.bzrsync-avz 2018-09-18-k8s-snapshot.db 192.168.105.93:/tmp/rsync-avz 2018-09-18-k8s-snapshot.db 192.168.105.94:/tmp/

Execute on lab1:

Cd / tmp/ETCDCTL_API=3 etcdctl snapshot restore etcdbk.db\-data-dir=/var/lib/etcd\-initial-advertise-peer-urls= "https://192.168.14.138:2380"\-initial-cluster=" docker10= https://192.168.14.140:2380,docker8=https://192.168.14.138:2380, Docker9= https://192.168.14.139:2380"\-initial-cluster-token etcd-cluster\-cacert=/etc/kubernetes/pki/etcd/ca.crt\-cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\-key=/etc/kubernetes/pki/etcd/healthcheck-client.key\-endpoints= "https://192.168.14.138:2379"\-name docker8"

Execute on lab2:

Cd / tmp/ETCDCTL_API=3 etcdctl snapshot restore etcdbk.db\-data-dir=/var/lib/etcd\-initial-advertise-peer-urls= "https://192.168.14.139:2380"\-initial-cluster=" docker10= https://192.168.14.140:2380,docker8=https://192.168.14.138:2380, Docker9= https://192.168.14.139:2380"\-initial-cluster-token etcd-cluster\-cacert=/etc/kubernetes/pki/etcd/ca.crt\-cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\-key=/etc/kubernetes/pki/etcd/healthcheck-client.key\-endpoints= "https://192.168.14.139:2379"\-name docker9"

Execute on lab3:

Cd / tmp/ETCDCTL_API=3 etcdctl snapshot restore etcdbk.db\-data-dir=/var/lib/etcd\-initial-advertise-peer-urls= "https://192.168.14.140:2380"\-initial-cluster=" docker10= https://192.168.14.140:2380,docker8=https://192.168.14.138:2380, Docker9= https://192.168.14.139:2380"\-initial-cluster-token etcd-cluster\-cacert=/etc/kubernetes/pki/etcd/ca.crt\-cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\-key=/etc/kubernetes/pki/etcd/healthcheck-client.key\-endpoints= "https://192.168.14.140:2379"\-name docker10"

After the full recovery is complete, the three Master machines restore manifests.

Mv / etc/kubernetes/manifests.bak / etc/kubernetes/manifests

Final confirmation:

# check key [root @ lab1 kubernetes] # etcdctl get /-- prefix-- keys-only-- cert=/etc/kubernetes/pki/etcd/server.crt-- key=/etc/kubernetes/pki/etcd/server.key-- cacert=/etc/kubernetes/pki/etcd/ca.crtregistry/apiextensions.k8s.io/customresourcedefinitions/apprepositories.kubeapps.com/registry/apiregistration.k8s.io/apiservices/v1./registry/apiregistration.k8s.io/apiservices/v1.apps/registry/apiregistration.k8s.io again / apiservices/v1.authentication.k8s.io. Omitted here. [root@lab1 kubernetes] # kubectl get pod-n kube-system [root@docker8 manifests] # kubectl get pods-n kube-system NAME READY STATUS RESTARTS AGEbackup 0 3h21mbackup-1565740800-d86j9 1 Completed 0 3h21mbackup-1565740800-d86j9 0 3h21mbackup-1565740800-d86j9 1 Completed 0 2d9h backupcolor 1565827200- Rz4lf 0 Completed 0 33hbackup-1565913600-vwbfd 0 3d3hetcd-docker10 1 Completed 0 9hbackup1 0 3d3hetcd-docker10 1 Completed 0 3d2hcoredns-5c98db65d4-9jhw2 1 Running 40 3d3hcoredns-5c98db65d4-mfx68 1 Running 40 3d3hetcd-docker10 1/1 Running 0 5m34setcd-docker8 1/1 Running 0 5m58setcd-docker9 1/1 Running 0 5m54skube-apiserver-docker10 1/1 Running 0 5m37skube-apiserver-docker8 1/1 Running 1 5m59skube -apiserver-docker9 1 + 1 Running 0 5m55skube-controller-manager-docker10 1 + 1 Running 1 3d3hkube-controller-manager-docker8 1 + 1 Running 1 3d3hkube-controller-manager-docker9 1 + 1 Running 1 3d 3h Kubemuri Flaannelmuri dsMuthamd64-7hkch 1 3d3hkube-flannel-ds-amd64-bnmdd 1 Running 0 3d3hkube-flannel-ds-amd64-bnmdd 1/1 Running 0 3d3hkube-flannel-ds-amd64-zvrvl 1/1 Running 0 3d3hkube-proxy-gxfkl 1/1 Running 0 3d3hkube-proxy-x2tlp 1/1 Running 0 3d3hkube-proxy-xpgjf 1/1 Running 0 3d3hkube-scheduler-docker10 1 to 1 Running 1 3d3hkube-scheduler-docker8 1 to 1 Running 2 3d3hkube-scheduler-docker9 1 to 1 Running 0 3d3h to confirm the appropriate installer The data are all normal. # 4.Etcd finally explains 1. This article is a synthesis of many articles, sorted out, and finally will put a reference link! 2. There is no problem with the verification of a single point of etcd data recovery, but the restore command can optimize 3. 5. There is no problem with cluster etcd recovery, because there is an initial-cluster-token=etcd-cluster-0 in the cluster, so it is not written and can be recovered. 4. Time is limited, I am free to continue to optimize editing. Under mark! # 5. Refer to the link [backup and recovery of kubernetes on Ali Cloud] (https://yq.aliyun.com/articles/336781)[kubeadm installation etcd backup and recovery) (https://blog.51cto.com/ygqygq2/2176492)[Kubernetes Master Node disaster recovery Operations Guide] (https://www.cnblogs.com/aguncn/p/9983603.html)[etcd Analysis] (https://jimmysong.io/kubernetes-handbook/concepts/etcd.html))

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.