Understand the design and practice of K8s log system in one article 04/09 Update SLTechnology News&Howtos

Understand the design and practice of K8s log system in one article

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

In the previous article, we introduced why we need a log system, why the cloud original log system is so important, and the difficulties in the construction of the cloud original log system. I believe that students such as DevOps, SRE, operation and maintenance have a deep understanding. This article goes straight to the point and will directly share with you how to build a flexible, powerful, reliable and scalable log system in the cloud native scenario.

Demand-driven architecture design

Technical architecture is the process of transforming product requirements into technology implementation. For all architects, it is very basic and important to be able to analyze product requirements thoroughly. Many systems will be overturned soon after they are completed, and the most fundamental reason is that the real demand for the product has not been solved.

My log service team has nearly 10 years of experience in logging, serving almost all Alibaba's internal teams, including e-commerce, payment, logistics, cloud computing, games, instant messaging, IoT and other fields. The optimization and iteration of product features over the years are based on the changes in logging requirements of each team.

Fortunately, we have achieved production on Aliyun in recent years, serving tens of thousands of corporate users, including Top1 Internet customers in domestic live streaming, short video, news media, games and other industries. There will be qualitative differences in product functions from serving one company to serving tens of thousands of companies. Shangyun urges us to think more deeply about which functions the log platform needs to solve for users, and what is the core demand of the log. How to meet the needs of various industries and different business roles.

Requirement decomposition and function Design

In the previous section, we analyzed the log requirements of different roles within the company and summed up the following points:

Support the collection of various log formats and data sources, including non-K8s can quickly find / locate problem logs, format semi-structured / unstructured logs in various formats, and support fast statistical analysis and visualization support for real-time calculation through logs and obtain some business indicators And supports real-time alarm based on business indicators (essentially APM) supports various dimensional correlation analysis of very large-scale logs, and can accept a certain time delay to easily connect with various external systems or support custom access to data. For example, docking third-party audit systems can realize intelligent alarm, prediction, root cause analysis, etc., based on logs and related timing information. And can support custom offline training methods to achieve better results.

In order to meet the above functional requirements, the functional modules that must be available on the log platform are:

Omni-directional log collection, supporting various collection methods of DaemonSet and Sidecar to meet different collection needs, while supporting the collection of various data sources such as Web, mobile, IoT, physical / virtual machine; real-time log channel, which is necessary to connect upstream and downstream functions to ensure that logs can be easily used by a variety of systems. Data cleaning (ETL: Extract,Transform,Load), cleaning logs in various formats, supporting filtering, enrichment, conversion, leak filling, splitting, aggregation, etc.; log display and search, which are necessary for all log platforms, can quickly locate logs and view log context according to keywords, but seemingly simple functions are the most difficult to do well. Real-time analysis, search can only locate some problems, and the analysis and statistics function can help quickly analyze the root causes of the problem, and can also be used to quickly calculate some business indicators. Stream computing, we usually use the flow computing framework (Flink, Storm, Spark Stream, etc.) to calculate some real-time indicators or to do some custom cleaning of the data. Offline analysis, operation and security-related requirements all need to carry out various-dimensional correlation calculation on a large number of historical logs, which can only be completed by the offline analysis engine of Thum1 at present; the machine learning framework can easily and quickly connect the historical logs to the machine learning framework for offline training, and load the training results into the online real-time algorithm library. Open source scheme design

Cdn.com/ca9654af34388e612fb46c6ea67b269d3ae7e89f.png ">

With the help of a strong open source community, we can easily implement such a logging platform based on the combination of open source software. The figure above is a very typical logging platform solution with ELK as the core:

Use FileBeats, Fluentd and other collection Agent to realize the unified data collection on the container. In order to provide more abundant upstream and downstream and buffering capacity, kafka can be used as the receiver of data acquisition. The original data collected need to be further cleaned. You can subscribe to the data in Kafka using Logstash or Flink, and then write it to kafka after cleaning. The cleaned data can be docked with ElasticSearch to do real-time query and retrieval, docking Flink to calculate real-time indicators and alarms, docking Hadoop to do offline data analysis, docking TensorFlow to do offline model training. Common visualization components such as grafana and kibana can be used to visualize the data. Why do we choose to do our own research?

Using the combination of open source software is a very efficient solution, thanks to the strong open source community and the experience of a large user group, we can quickly build such a system, and can meet most of our needs.

When we deploy this system, the logs can be collected from the container, can be checked on elasticsearch, SQL can be successfully executed on Hadoop, pictures can be seen on Grafana, and alarm messages can be received. After completing the above process, it may only take a few days to work overtime, and when the system finally runs through, you can finally breathe a sigh of relief and relax in the office chair.

However, the ideal is very plump and realistic. When we pre-send, test and produce, we begin to access the first application, and gradually more and more applications are connected, and more and more people begin to use it. At this time, many problems may be exposed:

With the increase of business volume, the log volume is also getting larger and larger. Kakfa and ES need to be continuously expanded. At the same time, the Connector from synchronous Kafka to ES also needs to be expanded. The most annoying thing is to collect Agent. The DaemonSet Fluentd deployed on each machine has no way to expand capacity at all. When it comes to the bottleneck of a single Agent, there is no way but to change Sidecar, not to mention the heavy workload of Sidecar, and it will bring a series of other problems. For example, how to integrate with CICD system, resource consumption, configuration planning, stdout collection does not support and so on. From the beginning of the marginal business, slowly more and more core business access, higher and higher requirements for the reliability of logs, there are often R & D responses from the ES can not find data, operators say that the statistical report is not accurate, security that the data obtained is not real-time. Each problem has to go through a lot of paths such as collection, queue, cleaning, transmission and so on, and the troubleshooting cost is very high. At the same time, it is necessary to build a monitoring scheme for the log system, which can find problems immediately, and this scheme can not be based on the log system and can not be self-dependent. When more and more developers begin to use the log platform to investigate problems, it often occurs because one or two people submit a large query, resulting in an increase in the overall load of the system, other people's queries will be Block, or even Full GC and so on. At this time, some powerful companies will transform ES to support multi-tenant isolation, or build different ES clusters for different business units, and finally have to operate and maintain multiple ES clusters, which is still a heavy workload. When we invested a lot of manpower and were finally able to maintain the daily use of the log platform, the company treasurer came over and said that we used a lot of machines and the cost was too high. At this time, we need to optimize the cost, but after thinking about it, we need so many machines. Every day, the water level of most machines is 20%, 30%, but the peak water level may reach 70%, so we can't withdraw, and the peak can't be withstood. At this time, we can only cut peaks and fill valleys, which is another lot of work.

These are the problems often encountered by a medium-sized Internet company in the construction of a log platform. In Ali, these problems will be magnified many times:

For example, in the face of Singles Day traffic, all the open source software on the market can not meet the needs of our large traffic. In the face of tens of thousands of business applications within Ali, thousands of engineers use it at the same time, concurrency and multi-tenant isolation we have to go to the extreme. In the face of many core orders, transactions and other scenarios, the stability of the whole link must require the availability of 3 9s or even 4 9s. Such a large amount of data every day is extremely important for cost optimization, and 10% of the cost optimization may bring hundreds of millions of benefits. Ali K8s log scheme

In view of the above problems, we have developed and polished a set of K8s log scheme over the years:

Use our self-developed log collection Agent Logtail to achieve K8s omni-directional data acquisition. At present, Logtail has millions of full deployment in the group, and its performance and stability have been tested at the financial level of Singles Day for many times. Data queue, cleaning and processing, real-time retrieval, real-time analysis, AI algorithm and other native integration, rather than building blocks based on various open source software, greatly reduce the data link length, and the reduction of link length also means less possibility of error. Queues, cleaning processing, retrieval, analysis, and AI engines are all deeply customized and optimized for log scenarios to meet the needs of large throughput, dynamic expansion, 100 million-level log second searchability, low cost, high availability and so on. For general requirements such as streaming computing and offline analysis scenarios, there are very mature products in both open source and Alibaba, which are supported by seamless docking. Currently, log service supports dozens of downstream open source and cloud products.

This system currently supports the log analysis of tens of thousands of enterprises in Ali Group, Ant Group and Cloud. The amount of data written every day is 16pb, so there are many problems and challenges in the development, operation and maintenance of such a system, so it is no longer carried out here. Interested students can refer to our team's technology sharing: Ali 10PB/ Sky Log system design and implementation.

Summary

This article mainly introduces how to build a log analysis platform for K8s from the architecture level, including open source solutions and a set of programs developed by Ali. However, there is still a lot of work to be done to land the system in the production environment and run it effectively.

What kind of posture do you use to log on K8s? Log collection scheme selection on K8s, DaemonSet or Sidecar? How does the logging scheme integrate with CICD? How to divide the log storage of each application under microservice? How to monitor K8s based on the log of K8s system? How to monitor the reliability of the logging platform? How to do automatic inspection of multiple micro services / components? How to automatically monitor multiple sites and achieve fast location in case of abnormal traffic?

In the follow-up article, we will share with you how to implement this system step by step. Please look forward to it.

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.