Summary of common exceptions in Docker 07/15 Update SLTechnology News&Howtos

Summary of common exceptions in Docker

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "the summary of common anomalies in Docker". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the summary of common anomalies in Docker.

Exception one

The docker ps does not respond, and the Node node appears as NotReady.

Operation information $docker- v$ Docker version 17.03.2-ce, build f5ec1e2 $docker-containerd-v$ containerd version 0.2.3 commit:4ab9917febca54791c5f071a9d1f404867857fcc$ docker-runc-v $runc version 1.0.0-rc2 $commit: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe$ spec: 1.0.0-rc2-dev enable Docker Debug mode

There are two ways to enable debugging. The recommended method is to set debug to true in the daemon.json file. This method applies to every Docker platform.

1. Edit the daemon.json file, which is usually located in / etc/docker/. If the file does not already exist, you may need to create it. two。 Add the following settings

{"debug": true}

3. Send a HUP signal to the daemon to reload its configuration.

Sudo kill-SIGHUP $(pidof dockerd)

You can see the Docker debug-level log:

Dec 14 20:04:45 dockerd [7926]: time= "2018-12-14T20:04:45.788669544+08:00" level=debug msg= "Calling GET / v1.27/containers/json?all=1&filters=%7B%22label%22%3A%7B%22io.kubernetes.docker.type%3Dpodsandbox%22%3Atrue%7D%7D&limit=0" Dec 14 20:04:45 dockerd [7926]: time= "2018-12-14T20:04:45.790628950+08:00" level=debug msg= "Calling GET / v1.27 containers Universe 1 & Filters=%7B%22label%22%3A%7B%22io.kubernetes.docker.type%3Dcontainer%22%3Atrue%7D%7D&limit=0 "Dec 14 20:04:46 dockerd [7926]: time=" 2018-12-14T20:04:46.792531056+08:00 "level=debug msg=" Calling GET / v1.27/containers/json?all=1&filters=%7B%22label%22%3A%7B%22io.kubernetes.docker.type%3Dpodsandbox%22%3Atrue%7D%7D&limit=0 "Dec 14 20:04:46 dockerd [7926]: Time= "2018-12-14T20:04:46.794433693+08:00" level=debug msg= "Calling GET / v1.27/containers/json?all=1&filters=%7B%22label%22%3A%7B%22io.kubernetes.docker.type%3Dcontainer%22%3Atrue%7D%7D&limit=0" Dec 14 20:04:47 dockerd [7926]: time= "2018-12-14T20:04:47.097363259+08:00" level=debug msg= "Calling GET / v1.27/containers/json?filters=%7B%22label%22%3A%7B%22io. Kubernetes.docker.type%3Dpodsandbox%22%3Atrue%7D%7D&limit=0 "Dec 14 20:04:47 dockerd [7926]: time=" 2018-12-14T20:04:47.098448324+08:00 "level=debug msg=" Calling GET / v1.27/containers/json?all=1&filters=%7B%22label%22%3A%7B%22io.kubernetes.docker.type%3Dcontainer%22%3Atrue%7D%2C%22status%22%3A%7B%22running%22%3Atrue%7D%7D&limit=0 "Dec 14 20:04:47 dockerd [7926]:

Dockerd has been requesting the list containers interface, but there is no response.

Print stack information $kill-SIGUSR1 $(pidof dockerd)

The generated debugging information can be found in the following directory:

.. goroutine stacks written to / var/run/docker/goroutine-stacks-2018-12-02T193336z.log...daemon datastructure dump written to / var/run/docker/daemon-data-2018-12-02T193336z.log

View the contents of goroutine-stacks-2018-12-02T193336z.log file

Goroutine 248 [running]: github.com/docker/docker/pkg/signal.DumpStacks (0x18fe090, 0xf, 0x0, 0x0, 0x0, 0x0) / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/pkg/signal/trap.go:82 + 0xfcgithub.com/docker/docker/daemon. (* Daemon) .setupDumpStackTrap.func1 (0xc421462de0, 0x18fe090, 0xf) 0xc4203c8200) / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/daemon/debugtrap_unix.go:19 + 0xcbcreated by github.com/docker/docker/daemon. (* Daemon). SetupDumpStackTrap / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/daemon/debugtrap_unix.go:32 + 0x10agoroutine 1 [chan receive, 91274 minutes]: main. (* DaemonCli) .start (0xc42048a840, 0x0, 0x190f560, 0x17, 0xc420488400, 0xc42046c820, 0xc420257320, 0x0 0x0) / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/cmd/dockerd/daemon.go:326 + 0x183emain.runDaemon (0x0, 0x190f560, 0x17, 0xc420488400, 0xc42046c820, 0xc420257320, 0x10, 0x0) / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/cmd/dockerd/docker.go:86 + 0xb2main.newDaemonCommand.func1 (0xc42041f200, 0xc42045df00, 0x0, 0x10, 0x0 0x0) / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/cmd/dockerd/docker.go:42 + 0x71github.com/docker/docker/vendor/github.com/spf13/cobra. (* Command) .execute (0xc42041f200, 0xc42000c130, 0x10, 0x11, 0xc42041f200 0xc42000c130) / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:646 + 0x26dgithub.com/docker/docker/vendor/github.com/spf13/cobra. (* Command) .ExecuteC (0xc42041f200, 0x16fc5e0, 0xc42046c801) 0xc420484810) / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:742 + 0x377github.com/docker/docker/vendor/github.com/spf13/cobra. (* Command) .Execute (0xc42041f200, 0xc420484810) 0xc420084058) / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:695 + 0x2bmain.main () / root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/cmd/dockerd/docker.go:106 + 0xe2goroutine 17 [syscall, 91275 minutes, locked to thread]:.

At this point, we can determine that the containerd does not respond to the docker ps caused by the non-response, in the stack we can also see that the call containerd is unresponsive because of the addition of lock.

View dmesg

Check the system exception information through dmesg and find the OOM overflow error reported by cgroup.

Memory cgroup out of memory: Kill process 20357 (runc: [2:INIT]) score 970 or sacrifice child

After looking at the dmesg information of most machines, it is found that there is an error of OOM, so we suspect that the containerd is unresponsive due to a container OOM.

Simulated OOM

Since it is suspected that the container OOM exception caused the containerd to be unresponsive, we might as well create our own live simulation.

The first choice is to create a deployment of OOM and schedule this deployment to the specified Node through nodeSelector.

ApiVersion: extensions/v1beta1kind: Deploymentmetadata: labels: wayne-app: oomtest wayne-ns: test app: oomtest name: oomtestspec: selector: matchLabels: app: oomtest template: metadata: labels: wayne-app: oomtest wayne-ns: test app: oomtestspec: nodeSelector: kubernetes.io/hostname: test-001 containers:-resources: limits: Memory: 0.2Gi cpu: '0.2' requests: memory: 0.2Gi cpu: '0.1' args: -'- m'-'10'--vm-bytes'-128m- '--timeout'-60s -'-vm-keep' image: progrium/stress name: stress

After a while, it is found that the Node of test-001 is unresponsive to docker ps. Check the stack information of dmesg and containerd and find that it is consistent with the exception of the previous Node. At this point, it is almost certain that some container OOM caused the containerd hung residence.

Cause analysis

By looking for the community Issues and related PR, it is found that the root cause is the bug of runc.

When you start the container with runc start or runc run, stub process (runc [2:INIT]) opens a fifo for writing. Its parent runc process will open the same fifo to read. In this way, they can be synchronized.

If stub process exits at the wrong time, the parent runc process will be blocked forever.

This happens when two runc operations compete with each other: runc run / start and runc delete. It can also happen for other reasons, such as the OOM killer of the kernel can choose to kill stub process.

Solution:

Solve this problem by solving exec fifo competition. If stub process exits before we open fifo, an error is returned.

Summary

Containerd has officially incorporated this change in v1.0.2. So this problem can be solved by upgrading the Docker version. We have upgraded some machines to Docker 18.06so far. No similar problems have been found in upgraded machines for the time being.

Related issues: https://github.com/containerd/containerd/issues/1882 https://github.com/containerd/containerd/pull/2048 https://github.com/opencontainers/runc/pull/1698

Anomaly two

Docker runs in direct-lvm mode under Centos system and cannot be started

Error starting daemon: error initializing graphdriver: devicemapper: Non existing device docker-thinpoolDec 14 03:21:03 two-slave-41-135systemd: docker.service: main process exited, code=exited, status=1/FAILUREDec 14 03:21:03 two-slave-41-135systemd: Failed to start Docker Application Container Engine.Dec 14 03:21:03 two-slave-41-135systemd: Dependency failed for kubernetes Kubelet.Dec 14 03:21:03 two-slave-41-135systemd: Job kubelet.service/start failed with result 'dependency'. Root cause

This problem occurs when using LVM thin pool before Docker attempts to reuse when using devicemapper storage drivers. For example, this problem occurs when you try to change the data directory of Docker on a node. This error occurs because security measures are designed to prevent Docker from accidentally using and overwriting data in LVM thin pool due to configuration problems.

Solution

To resolve the issue preventing Docker from starting, you must delete and recreate the logical volumes so that Docker treats them as the new thin pool.

Warning: these commands will clear all existing mirrors and volumes from the Docker data directory. Back up all important data before performing these steps.

1. Stop Docker

Sudo systemctl stop docker.service

two。 Delete the Dodcker directory

Sudo rm-rf / var/lib/docker

3. Delete thin pool logical volumes that have been created

$sudo lvremove docker/thinpoolDo you really want to remove active logical volume docker/thinpool? [YBO]: y Logical volume "thinpool" successfully removed

4. Create a new logical volume

Lvcreate-L 500g-- thin docker/thinpool-- poolmetadatasize 256m

Resize thinpool and metadata based on actual disk siz

Docker automatic direct-lvm mode configuration

If you want Docker to automatically configure direct-lvm mode for you, continue with the following steps.

1. Edit the / etc/docker/daemon.json file to change dm.directlvm_device_force = value from false to true. For example:

{"storage-driver": "devicemapper", "storage-opts": ["dm.directlvm_device_force=true"]}

two。 In addition to deleting logical volumes, delete the docker volume group:

$sudo vgremove docker

3. Start Dokcer

Sudo systemctl start docker.service summary

Although Docker is currently the most commonly used container solution, it still has many shortcomings.

The isolation of Docker is relatively weak, and mixed distribution can easily lead to business interaction. It may affect other services or even the entire cluster because there is a problem with one service.

Docker has some bug of its own. Due to historical reasons, many bug cannot be completely attributed to the kernel or Docker, and need to be repaired by Docker and kernel.

At this point, I believe you have a deeper understanding of the "summary of common exceptions in Docker". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.