Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Getting started with K8s from scratch | observability: monitoring and logging

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Author | Mo Yuan Alibaba technical expert

I. background

Monitoring and logging is an important infrastructure of large-scale distributed systems, monitoring can help developers to check the running status of the system, and logs can assist in troubleshooting and diagnosis of problems.

In Kubernetes, monitoring and logging are part of the ecology, and they are not core components, so most of the capabilities depend on the adaptation of the upper-level cloud vendors. Kubernetes defines the involved interface standards and specifications, and any component that conforms to the interface standards can be quickly integrated.

Type of monitoring and monitoring

First, take a look at monitoring. It can be divided into four different types in K8s.

1. Resource monitoring

It is a common indicator of resources such as CPU, memory and network. Usually, these indicators are counted in numerical and percentage units, which is the most common way of monitoring. This way of monitoring in the conventional monitoring, similar to the project zabbix telegraph, these systems can do.

two。 Performance monitoring

Performance monitoring refers to APM monitoring, that is, the check of monitoring indicators of some common application performance classes. Usually through some Hook mechanisms in the virtual machine layer, bytecode execution layer through implicit calls, or in the application layer display injection, to obtain a deeper monitoring index, which is generally used for application tuning and diagnosis. More common Zend Engine like jvm or php, through some common Hook mechanisms, such as the number of GC in jvm, a distribution of various memory generations, and some indicators of the number of network connections, to diagnose and tune the performance of applications in this way.

3. Safety monitoring and control

Security monitoring is mainly a series of security monitoring strategies, such as ultra vires management, security vulnerability scanning and so on.

4. Event monitoring

Event monitoring is an alternative monitoring method in K8s. The previous article introduced a design concept in K8s, which is a state transition based on the state machine. When you transition from a normal state to another normal state, an normal event occurs, while when you transition from a normal state to an abnormal state, an warning event occurs. Usually, we are more concerned about warning events, and event monitoring can take normal events or warning events offline to a data center, and then through data center analysis and alarm, expose some corresponding anomalies such as nails or text messages or e-mails to make up for some defects and shortcomings of conventional monitoring.

Monitoring Evolution of Kubernetes

In the early days, the K8s version before 1.10. Everyone will use components like Heapster for monitoring and collection, and the design principle of Heapster is actually relatively simple.

First of all, we have a wrapped cadvisor on each Kubernetes, which is the component responsible for data collection. When cadvisor completes the data collection, Kubernetes will package the data collected by cadvisor and expose them to the corresponding API. In the early days, there were actually three different kinds of API:

The first is the summary interface, the second is the kubelet interface, and the third is the Prometheus interface.

In fact, the corresponding data sources of these three interfaces are all cadvisor, but the data formats are different. In Heapster, two kinds of data acquisition interfaces, summary interface and kubelet, are actually supported. Heapster will regularly pull data from each node, aggregate data in its own memory, and then expose the corresponding service for upper-level consumers to use. Consumers who are more common in K8s, such as dashboard or HPA-Controller, call service to obtain the corresponding monitoring data to achieve the corresponding auto scaling and a display of the monitoring data.

This is a previous data consumption link, this consumption link looks very clear, and there is not much of a problem, so why would Kubernetes abandon Heapster and switch to metrics-service? In fact, a major source of power is that Heapster is standardizing the monitoring data interface. Why standardize the monitoring data interface?

The first point is that the needs of customers are ever-changing. For example, a resource collection of basic data is carried out with Heapster today. Tomorrow, I want to expose a data interface for the number of people online in the application, put it into my own interface system for data presentation, and a data consumption similar to HPA. Can this scene be done under Heapster? The answer is no, so this is the disadvantage of Heapster's own expansibility; the second point is that Heapster provides a lot of sink to ensure the offline ability of data, and this sink contains a series of sink such as influxdb, sls, nails and so on. What this sink mainly does is to collect the data and take the data offline, and then many customers will use influxdb to do this data offline and connect a visualization software similar to grafana monitoring data on influxdb to practice the visualization of monitoring data.

But then the community found out that most of the sink was left unmaintained. This also leads to a lot of bug in the entire Heapster project, this bug has been left in the community, no one can repair, this will also bring a lot of challenges to the activity of the community project, including the stability of the project.

Based on these two reasons, K8s break the Heapster and then make a simplified version of the monitoring and acquisition component, called metrics-server.

The image above shows an architecture within Heapster. You can find that it is divided into several parts, the first part is the core part, and then the upper layer has an API exposed through the standard http or https. Then in the middle is the part of source, the part of source is equivalent to the different interface of collecting data exposure, and then the part of processor is the part of data conversion and data aggregation. Finally, there is the sink part, and the sink part is responsible for data offline, which is the architecture of one of the early Heapster applications. In the later stage, K8s standardized the monitoring interface and gradually cut Heapster into metrics-server.

The current 0.3.1 version of metrics-server roughly has a structure like this, which is very simple: there is a core layer, a source layer in the middle, and a simple API layer, with an additional API Registration layer. The function of this layer is that it can register the corresponding data interface to the API server of K8s. In the future, customers no longer need to access metrics-server through this API layer, but can access the API registration layer through this API registration layer, through API server, and then to metrics-server. In this way, what the real data consumer may perceive is not a metrics-server, but a concrete implementation of such an API, which is metrics-server. This is one of the biggest changes to metrics-server.

Monitoring Interface Standard of Kubernetes

There are three different interface standards for monitoring in K8s. It standardizes and decouples the monitoring data consumption capacity, and realizes an integration with the community, which is mainly divided into three categories.

The first kind of Resource Metrice

The corresponding interface is metrics.k8s.io, and the main implementation is metrics-server, which provides resource monitoring. The common ones are node level, pod level, namespace level and class level. This kind of monitoring indicators can be obtained through the API metrics.k8s.io.

The second type Custom Metrics

The corresponding API is custom.metrics.k8s.io, and the main implementation is Prometheus. It provides resource monitoring and custom monitoring. There is actually an override relationship between resource monitoring and the above resource monitoring, and this custom monitoring refers to: for example, the application wants to expose a slow query similar to the number of people online, or to call the MySQL of the database behind. In fact, these can be defined in the application layer, and then through the standard Prometheus client, expose the corresponding metrics, and then be collected by Prometheus.

And once this kind of interface is collected, you can consume data through standards like custom.metrics.k8s.io, that is, if you access Prometheus in this way, you can HPA and consume data through custom.metrics.k8s.io.

The third type External Metrics

External Metrics is actually a special category because we know that K8s has now become an implementation standard for cloud native interfaces. Most of the time, you are dealing with cloud services on the cloud, for example, in an application, the front is the message queue, and the back is the RBS database. Sometimes, when consuming data, you need to consume some monitoring metrics of cloud products, such as the number of messages in the message queue, or the number of connection in the access layer SLB, the number of requests in the upper layer of SLB, and so on.

Then how to spend? A standard is also implemented in K8s, that is, external.metrics.k8s.io. The main implementation vendor is the provider of various cloud vendors, through which the provider can be used to monitor the indicators of cloud resources. An implementation of external.metrics.k8s.io used by Alibaba cloud metrics adapter to provide this standard has also been implemented on Aliyun.

Promethues-the monitoring "standard" of the open source community

Next, let's take a look at a more common monitoring solution in the open source community, which is Prometheus. Why is Prometheus the monitoring standard of the open source community?

One is because first of all, Prometheus is a graduation project of the CNCF cloud native community. Then the second is that now more and more open source projects use Prometheus as the monitoring standard, similar to our more common Spark, Tensorflow, Flink projects, in fact, it has a standard Prometheus acquisition interface. The second is that for projects such as some common databases and middleware, it has corresponding Prometheus collection clients. Like ETCD, zookeeper, MySQL or PostgreSQL, all of these actually have corresponding interfaces for this Prometheus. If not, there will also be an implementation of the interface for exporter in the community.

Let's first take a look at the whole structure of Prometheus.

The image above shows the data link collected by Prometheus, which can be divided into three different data acquisition links.

The first is the push way, which is to collect data through pushgateway, and then the data line goes to pushgateway, and then Prometheus goes to pushgateway to pull data through pull. The main scenario for this collection method is that your task may be relatively short-lived. For example, we know Prometheus, and the most common collection method is pull mode, which brings a problem, that is, once your data declaration period is shorter than the data collection cycle, for example, my collection cycle is 30 seconds, and I may run this task for 15 seconds. In this scenario, some data may be omitted. One of the easiest ways to do this scenario is to first download your metrics push through pushgateway, and then pull data from pushgateway through pull, which can be done without losing homework tasks for a short time. The second is the standard pull mode, which pulls data directly through the task of pulling the schema to the corresponding data. The third is Prometheus on Prometheus, which allows you to synchronize data to this Prometheus through another Prometheus.

This is the collection method in three kinds of Prometheus. From the data source, in addition to the standard static configuration, Prometheus also supports service discovery. In other words, some collection objects can be found dynamically through some service discovery mechanisms. In K8s, it is common to have the dynamic discovery mechanism of Kubernetes. As long as you configure some annotation, it can automatically configure the collection task for data acquisition, which is very convenient.

Etheus provides an external component called Alentmanager, which can alarm the corresponding alarm information by e-mail or SMS. In terms of data consumption, you can display and consume data through the upper API clients, through web UI, and through Grafana.

To sum up, Prometheus has the following five characteristics:

The first feature is a brief introduction to the powerful access standard. Developers only need to implement Prometheus Client as an interface standard to directly achieve a data collection; the second is a variety of data collection and offline methods. Data can be collected and offline through push, pull and Prometheus on Prometheus; the third is compatibility with K8s; the fourth is rich plug-in mechanism and ecology The fifth is a help of Prometheus Operator. Prometheus Operator is probably the most complex of all the Operator we have seen so far, but it is also an Operator that makes the dynamic capability of Prometheus incisively and vividly. If you use Prometheus in K8s, it is recommended that you use Prometheus Operator for deployment and operation and maintenance. Kube-eventer-Kubernetes event offline tool

Finally, we introduce you to an event offline tool in K8s called kube-eventer. Kube-eventer is an open source component of Ali Cloud CCS. It can take a series of eventer in K8s, such as pod eventer, node eventer, eventer of core components, eventer of crd, and so on, offline to similar eventer through this watch mechanism of API sever, such as SLS, Dingtalk, kafka, InfluxDB, and then audit, monitor and alarm for a time through this offline mechanism. We have now opened up this project to GitHub. If you are interested, you can take a look at this project.

The picture above is actually an alarm map of Dingtalk. You can see that there is a warning event, this event is under kube-system namespace, the specific pod, roughly one reason is that the pod restart failed, and then roughly reason is backoff, and then exactly when the event occurred. You can use this information to achieve a Checkups.

Third, the scene of the log

Next, let's introduce a part of the log in K8s. First of all, let's take a look at the log scenario. The log is mainly divided into four major scenarios in K8s:

1. The first log of the host kernel is the log of the host kernel. The host kernel log can help developers make some common problems and diagnoses, such as the exception of the network stack, similar to our iptables mark. It can see some message like controller table. The second is driver exception. It is more common that driver exceptions may occur in some network solutions, or in scenarios similar to GPU, driver exceptions may be common errors; the third is file system exceptions. In early docker scenarios where overlayfs or AUFS are not very mature, problems will often occur. After these problems, developers do not have a good way to monitor and diagnose. In this part, you can actually find some exceptions in the host kernel log, and then there are some exceptions that affect the node, such as some kernel panic in the kernel or some OOM, which will also be reflected in the host log. 2. Runtime's log

The second is the log of runtime, and the more common are some logs of Docker. We can use the log of docker to troubleshoot a series of problems like deleting some Pod Hang.

3. Log of core components

The third is the log of the core components, in K8s, the core components contain similar external middleware, like etcd, or like some built-in components, such as API server, kube-scheduler, controller-manger, kubelet and so on. The logs of these components can help us to see the usage of a resource in the control plane of the entire K8s cluster, and then whether there are any anomalies in the current running state.

There are also network middleware like some core middleware, such as Ingress, which can help us to see a traffic of the entire access layer. Through the log of Ingress, we can do a good application analysis of the access layer.

4. Log of deployment application

Finally, there is the log of the deployment application, which can be used to view a status of the business layer. For example, you can see if there are 500 requests from the business layer. Do you have any panic? Are there any abnormal wrong visits? In fact, all of these can be viewed by applying logs.

Log collection

First of all, let's take a look at log collection. From which location is collected, we need to support the following three types:

First of all, the host file, this scenario is more common is that my container, through something like volume, write the log file to the host. Perform log rotation through the log rotation policy of the host, and then collect the logs through this agent on my host.

The second is that there are log files in the container. How to deal with this common way? a more common way is to transfer it to stdout through the container of a streaming of Sidecar, write to the corresponding log-file through stdout, and then rotate it through a local log, and then collect it through an external agent. Third, we write directly to stdout, which is a relatively common strategy. The first is to directly collect this agent to the remote end, and the second is to directly collect it to the remote end through standard API like some sls.

In fact, what is more recommended in the community is to use a collection scheme of Fluentd. Fluentd will play a corresponding agent on each node, and then the agent will collect the data into a server of a Fluentd. In this server, the data can be offline to the corresponding similar elasticsearch, and then displayed through kibana; or offline to influxdb, and then displayed through Grafana. This is actually a relatively recommended practice in the community at present.

IV. Summary

Finally, I would like to give you a summary of today's lesson and introduce you to the best practices of monitoring and logging on Aliyun. At the beginning of the course, we introduced that monitoring and logging do not belong to the core components of K8s, but most of them define a standard interface and then adapt through the cloud vendor at the upper level.

Introduction to the components of Ali Cloud CCS Monitoring system

First of all, let me introduce to you the monitoring system in Aliyun CCS. This picture is actually a big picture of monitoring.

The four products on the right are closely related to monitoring logs:

Sls

The first is SLS, which is the log service. We have just mentioned that logs are divided into many different collections in K8s, such as logs with core components, logs at the access layer, logs for applications, and so on. In Ali Cloud CCS, you can collect audit logs through API server, and then collect access layer logs through things like service mesh or ingress controller, and then collect application logs as well as the corresponding application layer.

With this data link, it's not enough. Because the data link only helps us to achieve a data offline, we also need to do the upper data presentation and analysis. For example, audit logs can be used to see how many operations there are today, how many changes there are, whether there is attack, and whether there are any anomalies in the system. These can be viewed through the audited Dashboard.

ARMS

The second is a performance monitoring of the application. Performance monitoring can be viewed through products like this ARMS. JAVA and PHP, which are currently supported by ARMS, can be used to diagnose the performance of applications and tune problems through ARMS.

AHAS

The third is a special one called AHAS. AHAS is an architecture-aware monitoring, and we know that in K8s, most of the time, it is deployed through some micro-server architecture. The problem with microservice is that the components will change a lot, and the copies of the components will also change a lot. This brings a complexity in topology management.

If we want to see a trend of traffic in an application in K8s, or a troubleshooting of traffic anomalies, not a good visualization is very complicated. One of the functions of AHAS is that through a monitoring of the network stack, we can draw a topological relationship applied in the whole K8s, and then the corresponding resource monitoring and network bandwidth monitoring, traffic monitoring, and a diagnosis of abnormal events. If there is one level of architecture topology awareness, to implement another monitoring solution.

Cloud Monitor

Finally, there is Cloud Monitor, that is, basic cloud monitoring. It can collect standard Resource Metrics Monitoring for a display of monitoring data, and can achieve a display and alarm of monitoring indicators such as node, pod and so on.

Enhanced features of Aliyun

This part is an enhancement made by Aliyun on open source. The first is metrics-server, where the article begins to mention that metrics-server has done a lot of simplification. But from the customer's point of view, this streamlining is actually a tailoring of some functions, which will bring a lot of inconvenience. For example, there are many customers who want to take monitoring data offline to something like SLS or influxdb. In fact, there is no way to continue to do this with the community version. In this place, Aliyun continues to retain the common sink with high maintenance rate, which is the first enhancement.

Then there is the second enhancement, because an ecological development integrated in K8s does not evolve at the same pace. The release of Dashboard, for example, does not match the larger version of K8s. For example, if K8s sends 1.12 dashboard, it will not also release 1.12, but it will release it at its own pace. This will result in a result that many components that previously relied on Heapster will be break directly after upgrading to metrics-server. Aliyun has done complete Heapster compatibility on metrics-server, that is to say, from the current K8s1.7 version to K8s 1.14 version, Aliyun's metrics-server can be used. To achieve a compatibility of complete monitoring component consumption.

There are also eventer and npd, which mentioned kube-eventer as a component. Then on npd, we have also made a lot of additional enhancements, such as adding a lot of monitoring and detection items, such as kernel Hang, a detection of npd, monitoring of access to the network, and a detection of snat. Then there are check like fd, which are actually monitoring items in npd, and Aliyun has made a lot of enhancements. Developers can then directly deploy a check of npd, and then implement an alarm for node diagnosis, and then go offline through eventer to kafka or Dingtalk.

Further up is Prometheus Ecology, where developers can connect with Aliyun's HiTSDB and InfluxDB in the storage layer, and then provide optimized node-exporter and some scene-based monitoring exporter in the collection layer, similar to scene-based exporter such as Spark, TensorFlow and Argo. In addition, Aliyun has made a lot of additional enhancements to GPU, such as single-card monitoring that supports GPU and monitoring of GPU share. Then on Prometheus, together with the ARMS team, we have launched a hosted version of Prometheus. Developers can use helm chats out of the box and experience a monitoring and collection capability of Prometheus directly without the need to deploy Prometheus server.

Ali Cloud CCS log system

What enhancements did Aliyun make in the log? First of all, the collection method, to achieve a complete compatibility. You can collect pod log logs, core component logs, docker engine logs, kernel logs, and similar middleware logs, all collected by SLS. After collecting the SLS, we can use the data offline to OSS, offline to Max Compute, do a data offline and archiving, and offline budget.

Then there is the real-time consumption of some data, we can go to Opensearch, we can go to E-Map, we can go to Flink, to do a log search and an upper consumption. In the log display, we can not only dock open source Grafana, but also dock similar to DataV, to do data display, to achieve a complete data link collection and consumption.

First of all, the summary of this paper mainly introduces monitoring, including: the common monitoring methods in four container scenarios; the monitoring evolution and interface standards of Kubernetes; the monitoring schemes of two commonly used sources; in the log, we mainly introduce four different scenarios and a collection scheme of Fluentd; finally, we introduce a best practice of Aliyun log and monitoring.

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report