Series of articles: cloud native Kubernetes log landing scheme 04/26 Update SLTechnology News&Howtos

Series of articles: cloud native Kubernetes log landing scheme

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

I have been working in Logging for several years, and in the last year, more and more students have come to consult how to build a log system for Kubernetes or to ask for help on how to solve a series of problems in the process. It is better to teach people to fish than to teach people to fish. So we want to send out the experience we have accumulated over the years in the form of articles, so that students who see this article can take fewer detours. This series of articles is positioned as a long series, the content is biased towards landing exercises and experience sharing, and the content will be updated irregularly as the technology iterates.

Preface

The name of Kubernetes was first heard in 16 years, when Kubernetes was still in the era of "three Kingdoms" with Docker Swarm and Mesos. Kubernetes came to the fore in this competition because of a series of advantages (extensibility, declarative interface, cloud-friendly) and finally gained the dominant position. Kubernetes, as one of the core projects of CNCF (without one), is the landing base of Cloud Native (cloud native). At present, Ali is carrying out cloud native transformation of the whole station based on Kubernetes. Within 1-2 years, 100% of Alibaba's business will be on the public cloud.

The core of CloudNative's definition in CNCF is to build and run flexible, fault-tolerant, easy-to-manage, observable and loosely coupled application systems through Containers, Service Meshes, MicroServices, Immutable Infrastructure and Declarative APIs in public, private and hybrid clouds. Observability is an essential part of application systems, and one of the cloud's native design concepts is Diagnostic Design (Diagnosability), including cluster-level logs, Metric, and Trace.

Why do we need a log system?

Usually, the process of locating an online problem is to find the problem through Metric, locate the problem module according to Trace, and locate the cause of the problem according to the specific log of the module. The log includes errors, key variables, code running path and other information, which is the core of problem troubleshooting, so the log is always the necessary path for online problem troubleshooting.

Cdn.nlark.com/yuque/0/2019/png/347081/1567957254811-b4ac58ed-1e1b-4886-87dc-436154b57cb5.png ">

In Ali's more than ten years, the log system has been evolving with the development of computing form, which is roughly divided into three main stages:

In the stand-alone era, almost all applications are deployed on a single machine, and when the service pressure increases, only higher-specification IBM minicomputers can be switched. Log, as a part of the application system, is mainly used as a program Debug, usually combined with grep and other common Linux text commands for analysis. As the stand-alone system has become a bottleneck restricting Ali's business development, for the real Scale out, the Feitian project was launched: the Feitian 5K project was officially launched in 2013. At this stage, each business starts distributed transformation, and the invocation between services changes from local to distributed. In order to better manage, debug and analyze distributed applications, we have developed a Trace (distributed Link tracing) system and a variety of monitoring systems. The unified feature of these systems is the centralized storage of all logs (including Metric, etc.). In order to support faster development and iterative efficiency, we have started containerization transformation in recent years, and started to embrace Kubernetes ecology, full business cloud, Serverless and so on. At this stage, both the scale and type of logs show explosive growth, and the demand for digital and intelligent analysis of logs is getting higher and higher, so a unified log platform arises at the historic moment. The Ultimate interpretation of observability

In CNCF, the main role of observability is to diagnose problems, rising to the overall level of the company. Observability includes not only the field of DevOps, but also the fields of business, operation, BI, audit, security and so on. The ultimate goal of observability is to realize digitization and intelligence in all aspects of the company.

In Ali, almost all business roles involve a wide variety of log data. In order to support various application scenarios, we have developed a lot of tools and functions: log real-time analysis, link tracking, monitoring, data processing, flow computing, offline computing, BI system, audit system and so on. The log system mainly focuses on real-time data acquisition, cleaning, intelligent analysis and monitoring, as well as docking all kinds of flow computing and offline systems.

Difficulties in the construction of Kubernetes log system

There are many solutions for simple log system, and they are relatively mature, so I won't repeat them here. We only talk about the construction of log system on Kubernetes this time. The logging scheme on Kubernetes is very different from our previous logging scheme based on physical machine and virtual airport view, for example:

The form of the log becomes more complex, not only the log on the physical machine / virtual machine, but also the standard output of the container, the files in the container, container events, Kubernetes events and other information to be collected. The environment becomes more dynamic. In Kubernetes, machine downtime, offline, online, Pod termination, capacity expansion / reduction are all normal. In this case, the existence of logs is instantaneous (for example, if the Pod log is not visible after Pod termination), so log data must be collected to the server in real time. At the same time, we also need to ensure that the log collection can adapt to this highly dynamic scenario. There are many kinds of logs. The figure above shows a typical Kubernetes architecture. A request from the client needs to go through CDN, Ingress, ServiceMesh, Pod and other components, involving a variety of infrastructure, among which the types of logs have increased a lot, such as K8s various system component logs, audit logs, ServiceMesh logs, Ingress and so on. With the change of business architecture, more and more companies begin to implement micro-service architecture on Kubernetes. In micro-service architecture, the development of services is more complex, and the dependence between services and the underlying products of services is more and more. At this time, troubleshooting will be more complicated, and it will be a difficult problem to associate with the logs of various dimensions. Log solution integration is difficult, so we usually build a CICD system on Kubernetes. This CICD system needs to complete business integration and deployment as automatically as possible, in which log collection, storage and cleaning also need to be integrated into this system, which is consistent with K8s declarative deployment as much as possible. However, the existing log systems are usually independent systems, and it is very expensive to integrate them into CICD. With regard to log scale, we usually choose to build an open source log system at the initial stage of the system. There is no problem with this method during the test and verification phase or at the initial stage of the company's development, but when the business grows gradually, when the log volume grows to a certain scale, self-built open source systems often encounter a variety of problems, such as tenant isolation, query delay, data reliability, system availability, and so on. Although the log system is not the core path in IT, once these problems occur at critical moments, they will have a very terrible impact. For example, emergency problems occur when there is a big push, and multiple engineers make concurrent queries to blow up the log system during troubleshooting, resulting in longer fault recovery time and greater impact.

Summary

I believe that students engaged in the construction of K8s log system will be deeply impressed by the above analysis of the difficulties. Later, from the landing point of view, we will introduce in detail how to build K8s log system in Ali. Please pay attention.

The author of this article: Yuanyi original text link this article is the original content of Yunqi community and may not be reproduced without permission.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.