How to monitor Kubernetes during production 04/19 Update SLTechnology News&Howtos

How to monitor Kubernetes during production

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how to monitor Kubernetes during production. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Monitoring Kubernetes components

Monitoring Kubernetes clusters is not an easy task. To illustrate the types of errors that can occur, here is an example of our AWS configuration.

An example in our cluster perfectly demonstrates the health of running with SkyDNS and starting with all pods, however, after a few minutes, SkyDNS enters the "CrashLoopBackoff" state. The application containers are already started, but are still dysfunctional because they cannot reach the database the first time they are restarted.

The result turned out to be a cluster outage, but we could only stare at events and pods status without a clear understanding of what happened.

After contacting the master node and looking at SkyDNS pod's log, they exposed a problem with etcd. The SkyDNS cannot connect, or the connection becomes unstable immediately after it is established. Etcd itself is running, so what's the problem?

After doing quite a lot of research, we found answer. The high latency network connection disk results in read and write errors, which causes etcd to fail to write to the file system. Although it is correctly configured and working, it is not always available for Kubernetes services.

Learn the lesson-even if you have successfully set up a cluster, there is no guarantee that it will continue to work as expected.

So which problems are more likely to make mistakes during configuration? The main problems are as follows:

There is no connection between hosts

Lag caused by etcd downtime or unstable / misconfiguration

Overlay network layer between hosts is damaged

Any one of the individual nodes will go down.

Kubernetes API server or controller manager downtime

Docker failed to start the container

Network segmentation affects the subset of nodes

We exchanged some ideas with the participants in the first KubeCon and brainstormed the following possible solutions:

How do you evaluate the health of Kubernetes clusters? @ klizhenas suggests creating an app; that can schedule and unschedule pods. Has anyone created this?

-- Brandon Philips (@ Brandon Philips) 11 November 2015

Let's evaluate the way to monitor Kubernetes:

Typical monitoring

Application-oriented smoke testing

Typical monitoring solution

There is no shortage of traditional monitoring methods. One of the best choices in this category is monit.

This is an extremely lightweight (single execution file), and the battlefield-experienced daemon runs on thousands of machines-for a small start but limited to monitoring a single system. This is its biggest disadvantage.

One of the problems found in using monit is the limited execution and extensibility of a set of tests. Although configurable, we still have to expand its functionality by writing scripts, or through a weak interface to control special-purpose programs.

More importantly, we find it very difficult to connect several monit instances to a highly available system and elastic network, and the system and network have to agents to collect the information they share, and then work together to keep the information up to date.

Smoke type test

The definition of the term "smoke test":

"A series of preliminary tests to reveal the severity of some simple failures in order to reject the expected release of the software. it usually contains a subset of tests that cover most important roles to determine that the important role is running as expected. the most frequent feature of smoke testing is that it runs very fast, usually in seconds."

With our existing knowledge of Kubernetes, we firmly believe that we can use the smoke test to create a surveillance system using the following features:

Lightweight periodic testing

High availability and resilient network partitioning

Zero failure operating environment

The history of time series as health data

Regardless of the level of abstraction at which failures are likely to occur, even application failures, or low-level network errors, the system can track them to find the actual cause.

Monitoring Agents initiated by Serf

Our high-level solution is a series of programs Agent in which one node in one cluster resides on another. They communicate with each other through a gossip protocol provided by Serf:

The Agents monitoring status of the key components of Kubernetes-- the etcd,scheduler,API server and other things, and some that execute smoking programs-- create lightweight containers that can communicate with each other.

Agent synchronizes data periodically so that each node updates information about the cluster as a whole at any time. Because the consistency guarantee provided by Serf is weak, the update information is not very strict. Periodic test results are saved to the back end-this can be as simple as a SQLite database or a series of real-time databases such as InfluxDB.

Having a peer-to-peer system is very helpful for fault detection and monitoring information, even if key parts of the system are down. In the following example, the primary node and most of the nodes are down, which leads to a failure of etcd. However, we can still get diagnostic information about the cluster connecting to any of the following nodes:

Here is a screenshot of a partially damaged system:

Limit

Because of its simplicity, the current model has some limitations. If it is for a smaller cluster (such as 8 nodes), it can run, however, in a larger cluster, you do not want each node to be able to communicate with each other. The solution we plan to take is to create a special aggregator that borrows some ideas from Skype's super node or from Consul's "anti-entropy catelogs."

Conclusion

Monitoring the status of a Kubernetes cluster does not require the direct use of traditional monitoring tools. Manual troubleshooting has some complexity, and a large part of the complexity can be eliminated if there is an automatic feedback loop in the cluster. The Satellite project has proved to be useful for us when operating a cluster, so we decided to open source it in the hope that it will become a system to help improve kubernetes error detection.

Thank you for reading! This is the end of the article on "how to monitor Kubernetes in the production process". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.