What are the technologies used in the log collection system? 07/19 Update SLTechnology News&Howtos

What are the technologies used in the log collection system?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "what are the technologies used in the log collection system". In the daily operation, I believe that many people have doubts about the technology used in the log collection system. The editor consulted all kinds of data and sorted out the simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "what technologies are used in the log collection system?" Next, please follow the editor to study!

Overview

Great changes have taken place in the log from the initial human-oriented to the present machine-oriented. At first, the main consumers of logs were software engineers, who read logs to troubleshoot problems. Today, a large number of machines process log data day and night to generate readable reports to help humans make decisions. In the process of this transformation, log collection Agent plays an important role.

Agent, as a log collection, is actually a program that delivers data from the source to the destination. Usually, the destination is a centralized storage with the function of data subscription. The purpose of doing this is to decouple log analysis from log storage. Different consumers may be interested in the same log, and the log will be handled in different ways after obtaining the log. After decoupling the data storage and data analysis, different consumers can subscribe to the logs they are interested in and choose the corresponding analysis tools for analysis.

Centralized storage with data subscription features like this is more popular in the Kafka industry, corresponding to DataHub and Aliyun's LogHub inside Alibaba. The data source can be roughly divided into three categories, one is ordinary text files, the other is log data received through the network, and the last one is through shared memory. This article only talks about the first category. The core function of a log collection Agent is roughly like this.

On this basis, we can further introduce log filtering, log formatting, routing and other functions, which looks like a production workshop. From the way of log delivery, log collection can be divided into push mode and pull mode. This paper mainly analyzes the log collection of push mode.

Push mode refers to the log collection Agent actively obtains data from the source side and sends it to the destination side, while the pull mode refers to the current situation of the destination active log collection Agent acquisition source side data industry.

At present, the most popular log collection in the industry are Fluentd, Logstash, Flume, scribe and so on. Inside Alibaba is LogAgent, and Aliyun is LogTail. Among these products, Fluentd occupies the absolute advantage and successfully enters the CNCF camp. Its proposed Unified Log layer (Unified Logging Layer) greatly reduces the complexity of log collection and analysis.

Fluentd believes that most of the existing log formats are poorly structured, thanks to human excellent ability to parse log data, because log data is originally oriented to humans, and humans are its main log data consumers.

For this reason, Fluentd hopes to reduce the complexity of accessing the entire log collection by unifying the log storage format. Imagine that there are M kinds of log data input, and N kinds of storage are connected to the back end of the log collection Agent. Then each storage system needs to implement the function of parsing M kinds of log formats, and the total complexity is M + N. If the log collection Agent unifies the log format, then the total complexity becomes M + N. This is the core idea of Fluentd, and its plug-in mechanism is also commendable.

Logstash and Fluentd are similar to ELK technology stack, and are also widely used in the industry. For a comparison between them, please refer to this article Fluentd vs. Logstash: A Comparison of Log Collectors

Write a log to collect Agent from scratch

As a log collection Agent in the eyes of most people may be a data "porter", but also often complain that this "porter" uses too many machine resources, simply speaking, it is a tail-f command, which corresponds to the in_tail plug-in in Fluentd.

As a developer who has personally practiced log collection Agent, the author hopes to popularize some technical challenges in the development process of log collection Agent through this article. In order to make the whole article continuous, the author tries to describe the problems encountered in the whole development process through the theme of "writing a log collection Agent from scratch".

How do I find a file?

When we start to write logs and collect Agent, the first problem we encounter is how to find files. The easiest way is for users to list the files to be collected and put them in the configuration file. Then the log collection Agent will read the configuration files to find the list of files to be collected, and finally open these files for collection, which is probably the simplest.

However, in most cases, logs are generated dynamically and will be created dynamically in the process of log collection, so it is too troublesome to list them in the configuration file in advance. Normally, users only need to configure a rule that matches the log collection directory with the file name. For example, Nginx logs are placed in the / var/www/log directory, and the log files are named access.log and access.log-2018-01-10. In order to describe such files, such files can be matched by wildcards or regular representations such as: access.log (- [0-9] {4}-[0-9] {2}-[0-9] {2})? With such description rules, log collection Agent can know which files need to be collected and which files do not need to be collected.

Another problem we will encounter next is how to find the newly created log files. It may be a good way to poll the directory regularly, but if the polling cycle is too long, it will lead to insufficient real-time, too short and consume CPU, and you don't want your collection Agent to be complained about taking up too much CPU. The Linux kernel provides us with an efficient Inotify mechanism. The kernel monitors changes in files in a directory and then notifies the user through events.

But don't be happy too soon, Inotify is not as good as we thought, it has some problems, first of all, not all file systems support Inotify, in addition, it does not support recursive directory monitoring, for example, we monitor the A directory, but if we create a B directory under the A directory and then create the C file immediately, then we can only get the event created by the B directory, and the event created by the C file will be lost Eventually, this file will not be found and collected.

There is nothing Inotify can do about existing files, and Inotify can only discover newly created files in real time. More restrictions on the use of Inotify and bug are described in Inotify manpage. If you want to make sure you don't miss out, then the best solution is the combination of Inotify+ polling. Use a large polling cycle to detect missing files and history files, and use Inotify to ensure that newly created files can be found in real time in most cases, even in scenarios that do not support Inotify. Polling alone can work properly.

At this point, our log collection Agent can find the file, then we need to open the file and collect it. However, there is an unexpected situation, and the machine Crash has been dropped in the course of our collection. How can we ensure that the data we have already collected will not be collected again, and that we can continue in places that we did not collect last time?

The advantage of polling-based approach is to ensure that files will not be missed, unless bug occurs in the file system, the waste of CPU can be avoided by increasing the polling cycle, but the real-time is not enough. Although Inotify is very efficient and real-time, it cannot guarantee that 100% events will not be lost. Therefore, through the combination of polling and Inotify, we can learn from each other. Point files are highly available

Point file? Yes, it is through the point file to record the file name and the corresponding collection location. So how to ensure that this point file can be written reliably? Because the machine Crash may cause the point data to be lost or confused at the moment the file is written. To solve this problem, you need to make sure that the file write is either successful or failed, and there must be no case that the file is half written. The Linux kernel provides us with atomic rename.

One file can be transformed into another file by atomic rename, which can be used to ensure the high availability of point files. Suppose we already have a point file called offset. Every second we update the point file and record the collected location in it in real time. The whole update process is as follows:

Write the point data to the offset.bak file on disk

Fdatasync ensures that data is written to disk

Rename offset.bak to offset through rename system call

In this way, you can ensure that the point file is normal at any time, because each write will first ensure that the write to the temporary file is successful, and then replace it atomically. This ensures that offset files are always available. In extreme scenarios, the point within 1 second will not be updated in time, and the log collection Agent will collect and resend the data within 1 second after starting, which basically meets the demand.

But the file name and the corresponding collection location are recorded in the point file, which leads to another problem. What if the file is renamed during the process Crash? Then after starting, you can't find the corresponding collection location. In this log scenario, the file name is actually very unreliable. File renaming, deletion, soft chain, etc., will cause the same file name to point to different files at different times. And saving the entire file path in memory is actually very memory-consuming.

The Linux kernel provides inode as the identification information of the file, and ensures that the Inode will not be repeated at the same time, so that the above problem can be solved by recording the inode of the file and the location of the collection in the point file. Log collection Agent starts through file discovery to find the file to be collected, then find the corresponding collection location from the point file by obtaining Inode, and then continue to collect. Then even if the file is renamed, its Inode will not change, so the corresponding collection location can still be found in the point file.

But are there any restrictions on Inode? Of course, there is no free lunch, different file systems Inode will repeat, a machine can install multiple file systems, so we also need to use dev (device number) to further distinguish, so point files need to record dev, inode, offset triples. So far, our collection Agent can collect logs normally, even if the Crash is started again, we can still continue to collect logs.

But suddenly one day we found out that two files are the same Inode,Linux kernel, isn't it guaranteed that they won't be repeated at the same time? Is it the kernel bug? Note that I use "at the same time", and the kernel can only guarantee that it will not be repeated at the same time. What does that mean? This is a big technical challenge in log collection Agent, how to identify a file accurately.

How do I identify a file?

How to identify a file is a challenging technical problem in log collection Agent. We first identify it by the file name, and then we find that the file name is unreliable and resource-consuming. Later, we changed it to dev+Inode, but found that Inode can only guarantee that Inode does not repeat at the same time. What does this sentence mean?

Imagine that at T1 moment there is a file Inode 1 that we found and started collecting. After a period of time, the file is deleted, and the Linux kernel will release the Inode. After creating a new file, the Linux kernel will assign the newly released Inode to the new file. Then when the new file is found, it will query where it was last collected from the point file, and the result will find the point recorded in the previous file, resulting in the new file being collected from the wrong location.

Fortunately, the Linux kernel provides an extended attribute xattr to the file system, so we can generate a unique identity for each file and record it in the point file. If the file is deleted, then create a new file, even though the Inode is the same, but the file identity is different. Log collection Agent can recognize that these are two files.

The problem, however, is that not all file systems support the xattr extension attribute. So the extended attribute only solves part of the problem. Maybe we can solve this problem through the contents of the file, we can read the first N bytes of the file as the file identity. This can be regarded as a solution, but how big is this N?

The greater the probability of the same, the smaller the probability of being unrecognizable. To really achieve 100% identification of the general solution has yet to be investigated, let's assume that 80% of the problems have been solved here. Next, you can feel at ease to collect logs. Log collection is actually reading files. The process of reading files needs to be read as sequentially as possible, making full use of the Linux system cache. If necessary, you can use posix_fadvise to clear the page cache after collecting log files and actively release system resources. So when will a file be collected?

The collection is finished when the collection is returned to EOF at the end of the collection. However, after a while, there will be new content in the log file, how to know that there is new data, and then continue to collect?

How do I know the contents of the file have been updated?

Inotify can solve this problem and monitor a file through Inotify, so as long as there is new data in the file, it will trigger an event, and you can continue to collect the event after getting the event. However, a problem with this scheme is that in the scenario where a large number of files are written, the event queue will overflow. For example, N events will be generated if the user writes to the log for N times in a row. In fact, the Agent for log collection can be updated as long as it knows the content. As for updating several times, this is not important, because each collection actually continues to read the file until EOF, as long as the user writes the log continuously. Then it will continue to be collected.

In addition, there is an upper limit on the number of files that Intofy can monitor. Therefore, the most simple and general scheme here is to poll to query the stat information of the files to be collected, collect the files when they are updated, and trigger the next poll after the collection is completed, which is both simple and universal. Through these means, log collection Agent can finally continue to collect logs without interruption. Since logs will always be deleted at some point, what if they are deleted in the course of our collection?

You can rest assured that there is a reference count of files in Linux, and even if the open file is deleted, it is only the reference count minus 1, as long as there is a process reference, you can continue to read the content, so the log collection Agent can rest assured to continue to read the log, and then release the fd of the file, allowing the system to really delete the file. But how do you know when the collection is over?

Nonsense, it is said that the collection is finished at the end of the file, but if there is another process that opens the file at the moment and adds a paragraph of content after you have collected all the contents, and you have already released fd, this file is no longer on the file system, and there is no way to find the file through the file discovery, open it and read the data, what should we do?

How to release the file handle safely?

The way to deal with Fluentd is to shift the responsibility of this part to the user and let the user configure a time. If there is no data added within the specified time range, the fd will be released after the file is deleted. In fact, this is an indirect behavior. A too small time configuration will increase the probability of data loss, which will lead to the false appearance that fd and disk space have been occupied all the time, resulting in a free waste of short time.

The essence of this problem is that we don't know who else is referencing this file, and if someone is still referencing this file, it may write data. At this time, even if you release fd resources, it is better not to release them. If no one is referencing this file, you can actually release fd immediately. How do you know who is referencing this file?

You must have used lsof-f to list the files opened by processes in the system. This tool scans all the file descriptors in the / proc/PID/fd/ directory of each process, and you can view the file path corresponding to this descriptor through readlink, such as the following example:

Tianqian-zyf@ubuntu:~$ sudo ls-al / proc/22686/fd total 0 dr-x- 2 tianqian-zyf tianqian-zyf 0 May 27 12:25. Dr-xr-xr-x 9 tianqian-zyf tianqian-zyf 0 May 27 12:25.. Lrwx- 1 tianqian-zyf tianqian-zyf 64 May 27 12:25 0-> dev/pts/19 lrwx- 1 tianqian-zyf tianqian-zyf 64 May 27 12:25 1-> / dev/pts/19 lrwx- 1 tianqian-zyf tianqian-zyf 64 May 27 12:25 2-> / dev/pts/19 lrwx- 1 tianqian-zyf tianqian-zyf 64 May 27 12:25 4-> / home/tianqian-zyf/.post.lua.swp

22686 this process opens a file, fd is 4, and the corresponding file path is / home/tianqian-zyf/.post.lua.swp. Through this method, you can query the reference count of the file. If the reference count is 1, that is, only the current process references, then you can basically release fd safely without causing data loss, but the problem is that it is a bit expensive. You need to traverse all the processes to see their open file tables one by one, and the complexity is O (n). The problem can only be solved perfectly if O (1) can be achieved.

By searching the relevant information, I found that it is almost impossible to do this in the user mode, and the Linux kernel does not expose the relevant API. It can only be solved through Kernel, such as adding an API to get the reference count of the file through fd. This is relatively easy to do in the kernel. Each process saves the open file, which is the struct file structure in the kernel. Through this structure, you can find the corresponding struct inode object of this file, and the reference count value is maintained inside this object. Look forward to the subsequent Linux kernel to provide the relevant API to solve this problem perfectly.

At this point, the study of "what technologies are used in the log collection system" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.