What are the skills for solving the log output problem in K8s? 02/14 Update SLTechnology News&Howtos

What are the skills for solving the log output problem in K8s?

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

What are the skills to solve the log output problem in K8s, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Preface

We will build the log monitoring system in K8s step by step from a practical point of view.

The first step in building a log system is how to generate these logs, which is often the most complicated and difficult step.

The importance of diary in Kubernetes

Usually, the most basic function of log is to record the running track of the program, from which a lot of functions will be derived, such as online monitoring, alarm, operation analysis, security analysis, etc., which in turn have certain requirements for logs. We need to standardize logs as much as possible to reduce the cost of collection, parsing and analysis.

In Kubernetes, the environment is very dynamic, and logs are basically volatile, so it is necessary to collect logs to the central storage in real time. In order to cooperate with log collection, there will be more requirements for log output and collection.

Below we list the common considerations for log output in Kubernetes (where the items marked (*) are unique to Kubernetes):

How to choose a log level

The log level is a description used to distinguish the severity of the corresponding events in the log, which is an option that must be available in all logs. Logs are usually divided into six different levels:

FATAL (fatal): used to output errors that are very serious or not expected to occur. If you encounter such errors, you should immediately report to the police and manually intervene to deal with them.

ERROR (error): an unexpected error that may cause some system exceptions without affecting the core business and the normal operation of the system.

WARN (warning): potentially dangerous or noteworthy information (core path)

INFO (information): details of the execution process of the application, through which you can generally see the main execution process of each request.

DEBUG (debugging): log information for offline debugging, which is used to analyze the execution logic of applications. Do not open online applications.

TRACE (tracking): outputs the most detailed tracks that may contain the data involved.

As a programmer, you must set the log level reasonably. I have summed up the following experiences in the process of development:

FATAL type logs must be printed by very serious errors and scenes that need to be manually processed.

The difference between ERROR and WARNING is difficult for many programmers to choose. You can consider it from the perspective of alarm: ERROR generally requires alarm, WARNING does not.

On the one hand, the log level is to indicate the severity of the log, on the other hand, it is also to control the log output of the application. Usually only INFO or WARN logs can be opened online.

DEBUG logs can be typed more frequently to facilitate problem analysis.

All user request logs must be recorded

For uncertain external system calls, the log should be as comprehensive as possible

The log library in the program needs to have the ability to change the log level during the run time, so that it is convenient to temporarily change the log level when problems need to be analyzed.

Usually when new features are launched, the logs involved can be appropriately upgraded to facilitate real-time observation and monitoring, and then adjusted to normal after stabilization (remember to add comments to facilitate change back).

Log content specification

Usually in the case of no constraints, programmers play wild, all kinds of log content will appear, these logs that only developers can understand are difficult to analyze and alarm. So we need a log top-down specification to constrain developers in the project so that all logs appear to be printed by one person and easy to analyze.

Fields of the log

The fields that are usually required in the log are: Time, Level, Location. For a specific module / process / business, some Common fields are also required, such as:

If you use the Trace system, you can attach TraceID to the log

A fixed process needs to attach corresponding fields, for example, in the life cycle of an order, there must be information such as order number and user ID, which can be attached to the log instance of the corresponding process through Context.

HTTP requests need to be recorded: URL, Method, Status, Latency, Inflow, OutFlow, ClientIP, UserAgent, etc. For more information, please see Nginx log format.

If the logs of multiple modules are printed to the same stream / file, there must be a field identification module name.

The field specification of the log is best driven by the operation and maintenance platform / middleware platform from top to bottom, constraining the programmers of each module / process to print the log in accordance with the regulations.

Log expression form

Generally speaking, we recommend using the log format of KeyValue pairs, for example, our Ali Feitian log database uses this form:

[2019-12-30 21 path:pangu://localcluster/index/3/prom/7/1577711464522767696_0_1577711517 min_time:1577712000000000 max_time:1577715600000000 normal_count:27595 config:prom start_line:57315569 end_line:57343195 latency 45 30.611992] [WARNING] [958] [block_writer.cpp:671] path:pangu://localcluster/index/3/prom/7/1577711464522767696_0_1577711517 min_time:1577712000000000 max_time:1577715600000000 normal_count:27595 config:prom start_line:57315569 end_line:57343195 latency (ms): 42 type:AddBlock

The log of the KeyValue pair can be completely self-parsed and easy to understand, and it is convenient for automatic parsing when the log is collected.

In addition, JSON log format is recommended, there are many log libraries output in JSON format, and most log collection Agent supports log collection in JSON format.

{"addr": "tcp://0.0.0.0:10010", "caller": "main.go:98", "err": "listen tcp: address tcp://0.0.0.0:10010: too many colons in address", "level": "error", "msg": "Failed to listen", "ts": "2019-03-08T10:02:47.469421Z"}

Note: non-readable log formats (such as ProtoBuf, Binlog, etc.) are not recommended in most scenarios.

The problem of single log line wrapping

If it is not necessary, try not to output a log into multiple lines, which is expensive for collection, parsing and indexing.

Reasonable control of log output

The output of logs directly affects the use of disks and the performance consumption of applications. Too many logs are not conducive to viewing, collection and analysis; too few logs are not conducive to monitoring, and there is no way to investigate when problems arise.

In general, online applications need to reasonably control the amount of data in the log:

The request and response logs of the service entry are output and collected for no special reason, and the collected fields can be adjusted according to the requirements.

Error logs usually need to be printed. If there are too many, you can print them by sampling.

Reducing invalid log output, especially printing logs in a loop, should be minimized.

Request logs (such as Ingress, Nginx access logs) generally do not exceed 5MB/s (500byte each, no more than 1W/s), and application logs do not exceed 200KB/s (2KB each, no more than 100s / s).

Select multiple log output targets

It is recommended to apply different types of logs to different targets (files), so as to facilitate classified collection, viewing and monitoring. For example:

The access log is placed in a separate file. If there are not many domain names, you can follow the form of one domain name and one file.

Put a separate file for the log of the error class, and configure the monitoring alarm separately

Call the log of the external system to put a separate file to facilitate subsequent reconciliation and audit

Middleware is usually provided by a unified platform, and logs are usually printed as a separate file.

Control log performance consumption

Logs, as an auxiliary module of the business system, must not affect the normal operation of the business, so the performance consumption of the log module requires separate attention. Generally, when selecting / developing a log library, you need to test the performance of the log library to ensure that the performance consumption of the log does not exceed 5% of the overall CPU.

Note: make sure that log printing is asynchronous and does not block the operation of the business system.

How to select a log library

There are many open source log libraries, and there are dozens of them in almost every language. Choosing a log library that meets the needs of the company / business requires careful selection. A simple guideline is to use a stable version of the popular log library as much as possible, and you are less likely to enter the pit. For example:

Java uses Log4J, LogBack

Golang uses go-kit

The log library integrated by default in Python is sufficient for most scenarios. It is recommended to read CookBook

C++ recommends the use of spdlog, high performance, cross-platform.

Log form selection

In virtual machine / physical machine scenarios, most applications output logs in the form of files (only some system applications output to syslog/journal). In container scenarios, there is one more standard output. If the application types the logs to stdout or stderr, the logs will automatically enter into the log module of docker and can be viewed directly through docker logs or kubectl logs.

The standard output of the container is only suitable for relatively single applications, such as some system components in K8s. Online service applications usually involve multiple levels (middleware) and interact with various services, and logs are generally divided into several categories. If all are printed to the standard output of the container, it is difficult to distinguish processing.

At the same time, the standard output of the container consumes the performance of DockerEngine, and the log volume of the measured 10W/s will take up an additional CPU (100% per core) of DockerEngine.

Whether the log is down and the media is off the disk.

In Kubernetes, you can also connect the log library directly to the log system, and transfer the log directly to the back end of the log system without dropping the disk when the log is printed. This method eliminates the process of log storage and Agent collection, and the overall performance will be much higher.

In this way, we generally only recommend the use of scenarios with a large log volume. In general, the disk is set directly. Compared with sending it directly to the backend, the disk is added with a layer of file cache, and certain data can be cached in the case of network failure. When the log system is not available, our R & D OPS students can directly view the log of the file to improve the overall reliability.

Kubernetes provides a variety of storage methods, such as local storage, remote file storage, object storage and so on. Because the QPS of log writing is very high, and it is also directly related to the application, if you use remote type of storage, there will be 2-3 additional network communication overhead. It is generally recommended to use local storage, either HostVolume or EmptyDir, so that the performance impact on writes and collections will be as small as possible.

How to ensure the storage cycle of logs

Compared with the traditional virtual machine / physical machine scenario, Kubernetes provides powerful scheduling, fault tolerance, shrinking / expanding capabilities for nodes and application layers. We can easily make applications run with high reliability and extreme flexibility through Kubernetes. One of the phenomena brought about by these advantages is that nodes and containers are created / deleted dynamically, so that logs can be destroyed at any time, and there is no way to ensure that the storage cycle of logs can meet DevOps, audit and other related requirements.

The long-term storage of logs in a dynamic environment can only be realized through centralized log storage. Through real-time log collection, the logs of each node and container can be collected to the log center system within seconds. Even if the node / container hangs, it can also restore the scene through the log.

Log output is a very important link in the construction of log system, the company / product line must follow a unified log specification, so as to ensure that subsequent log collection, analysis, monitoring and visualization can be carried out smoothly.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.