Evolution from ELK to EFK 04/26 Update SLTechnology News&Howtos

Evolution from ELK to EFK

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Background

As the largest online education site in China, users of Hujiang Log Service currently include log search and analysis services for multiple products from online schools, transactions, finance, CCTalk and other departments. More than a dozen kinds of logs are generated every day, and about 1 billion (1TB) logs are processed every day. Hot data is retained for the last 7 days, while cold data is kept permanently.

Why do you do a log system?

First of all, what is a journal? A log is a program-generated text data that follows a certain format (usually with a timestamp).

Usually, logs are generated by the server and output to different files, such as system log, application log and security log. These logs are stored separately on different machines.

Usually when the system fails, engineers need to log in to each server and use Linux scripting tools such as grep / sed / awk to find the cause of the failure in the log. In the absence of a log system, you first need to locate the server that handles the request. If the server deploys multiple instances, you need to go to the log directory of each application instance to find the log files. Each application instance also sets a log scrolling policy (for example, generating a file per day), as well as a log compression and archiving policy.

Such a series of processes have caused a lot of trouble for us to troubleshoot and find out the cause of the fault in time. Therefore, if we can manage these logs centrally and provide centralized retrieval function, we can not only improve the efficiency of diagnosis, but also have a comprehensive understanding of the system situation and avoid the passivity of fire fighting afterwards.

In my opinion, log data plays a very important role in the following aspects:

Data search: find out the solution service diagnosis by retrieving the log information and locating the corresponding bug: through the statistics and analysis of the log information, understand the server load and service running status data analysis: you can do further data analysis, such as finding out the courses that TOP10 users are interested in according to the course id in the request.

To solve these problems, in order to provide a distributed monitoring system for real-time log collection and analysis, we adopt a log data management solution commonly used in the industry, which mainly includes three systems: Elasticsearch, Logstash and Kibana. Usually, the industry calls this solution ELK for short, taking the initials of three systems, but after practice, we further optimize it as EFK,F stands for Filebeat to solve the problems caused by Logstash. Next, we will introduce it in detail.

The ELK stack versions involved in this article are:

Elasticsearch 5.2.2Logstash 5.2.2Kibana 5.2.2Filebeat 5.2.2Kafka 2.10

Logstash: data collection and processing engine. Support dynamic collection of data from a variety of data sources, and filter, analyze, enrich, unify the format and other operations, and then store for subsequent use.

Kibana: visualization platform. It can search and display indexed data stored in Elasticsearch. It can be easily used to display and analyze data with charts, tables and maps.

Elasticsearch: distributed search engine. It has the characteristics of high scalability, high reliability, easy management and so on. It can be used for full-text retrieval, structured retrieval and analysis, and can combine the three. Elasticsearch is developed based on Lucene and is now one of the most widely used open source search engines. Wikipedia, StackOverflow, Github and so on are all based on it to build their own search engines.

Filebeat: lightweight data collection engine. Based on the original Logstash-fowarder source code transformation. In other words: Filebeat is the new version of Logstash-fowarder, and it will be the first choice for ELK Stack on the shipper side.

Since we want to talk about the application of ELK in Hujiang system, then we have to talk about ELK architecture. This sharing mainly lists the ELK architectures that we have used, and discusses the suitable scenarios and advantages and disadvantages of various architectures for your reference.

Simple version architecture

In this architecture, we directly connect Logstash instances to Elasticsearch instances. The Logstash instance reads data from the data source (such as Java log, Nginx log, etc.) directly through the Input plug-in, filters the log through the Filter plug-in, and finally writes the data to the ElasticSearch instance through the Output plug-in.

At this stage, log collection, filtering, output and other functions are mainly composed of these three core components: Input, Filter, and Output.

Input: input. Input data can be File, Stdin (input directly from the console), TCP, Syslog, Redis, Collectd, etc.

Filter: filter and output the log to the format we want. Logstash has rich filtering plug-ins: Grok regular capture, time processing, JSON codec, data modification Mutate. Grok is the most important plug-in in Logstash, and it is strongly recommended that everyone use Grok Debugger to debug their own Grok expressions

Grok {match = > ["message", "? M)\ [% {LOGLEVEL:level}\]\ [% {TIMESTAMP_ISO8601:timestamp}\]\ [% {DATA:logger}\]\ [% {DATA:threadId}\]\ [% {DATA:requestId}\]% {GREEDYDATA:msgRawData}"]}

Output: output. The output targets can be Stdout (output directly from the console), Elasticsearch, Redis, TCP, File, etc.

This is the simplest form of ELK architecture, where the Logstash instance connects directly to the Elasticsearch instance. The advantage is that it is easy to build and easy to use. It is recommended for beginners to learn and reference, can not be used in the online environment.

Cluster architecture

Under this architecture, we use multiple Elasticsearch nodes to form an Elasticsearch cluster. Because Logstash and Elasticsearch run in cluster mode, the cluster mode can avoid the problem of excessive pressure on a single instance. At the same time, Logstash Agent is deployed on each server online to meet the scenarios of small amount of data and low reliability.

Data collection side: Logstash Shipper Agent is deployed on each server to collect logs on the current server. The logs are transferred to the Elasticsearch cluster through Input plug-ins, Filter plug-ins and Output plug-ins in Logstash Shipper.

Data storage and search: Elasticsearch configuration can be satisfied by default, and we will decide whether to add a copy according to the importance of the data. If necessary, we can add at most one copy.

Data display: Kibana can make various charts based on Elasticsearch data to visually show the real-time status of the business.

The usage of this architecture is very limited, and there are two main problems.

Consumption of server resources: Logstash collection and filtering are completed on the server, which results in high occupation of system resources, poor performance, difficulties in debugging and tracking, and data loss in exception handling: in the case of large concurrency, due to the high peak value of log transmission, there is no message queue for buffering, which will lead to data loss in the Elasticsearch cluster.

This architecture is slightly more complex than the previous version, but it is also easy to maintain and can meet the needs of businesses with small amount of data and low reliability.

Introduce message queue

In this scenario, multiple data is first collected through Lostash Shipper Agent, and then delivered to the Kafka cluster through the Output plug-in, so that when the data receiving capacity of the Logstash exceeds the processing capacity of the Elasticsearch cluster, the queue can be used to cut the peak and fill the valley, and the Elasticsearch cluster does not have the problem of data loss.

At present, in the log service scenario, the two message queues that are widely used in the industry are listed as Kafka VS Redis. Although the ELK Stack official website recommends using Redis for message queuing, we recommend using Kafka. Mainly from the following two aspects:

Data loss: Redis queues are mostly used for real-time message push and are not guaranteed to be reliable. Kafka guarantees reliable but somewhat delayed data accumulation: the size of the Redis queue depends on the size of the machine's memory, and if it exceeds the set Max memory, the data is discarded. The stacking ability of the Kafka depends on the size of the machine's hard disk.

Based on the above reasons, we decided to use Kafka to buffer the queue. However, there are still a series of problems under this framework.

Logstash shipper also consumes CPU and memory resources when collecting data. Multi-room deployment is not supported.

This architecture is suitable for the deployment of applications with large clusters, and solves the problems of message loss and network congestion through message queuing.

Multi-computer room deployment

With the rapid growth of Hujiang business, the structure of single computer room has been unable to meet the demand. Inevitably, the business of Hujiang needs to be distributed to different computer rooms, which is also a big challenge for log service. Of course, there are many mature methods in the industry, such as Ali's unitalization, Tencent's SET scheme and so on. Unitalization is not carried out in detail here. You can refer to Weibo's [unitized architecture].

Finally, we decided to adopt the method of unitary deployment to solve the problems encountered in ELK multi-data center (delay, excessive direct connect traffic, etc.). The generation, collection, transmission, storage and display of logs are all digested in a closed loop in the same data center, and there is no problem of cross-room transmission and call. Because closely interacting applications are deployed in the same computer room as far as possible, this solution will not cause trouble to business queries.

Logstash, Elasticsearch, Kafka and Kibana clusters are all deployed in the same data center. Each data center needs its own log service cluster. For example, the logs of server room A business can only be transferred to the Kafka of this data center, while the Indexer cluster of server room An is consumed and written to the Elasticsearch cluster of server room A, and is demonstrated by the Kibana cluster of server room A. any step in the middle does not depend on any service of server room B.

Introduction of Filebeat

Filebeat is based on the original logstash-forwarder source code transformation, does not need to rely on the Java environment to run, the installation package 10m less.

If the number of logs is large, Logstash will encounter the problem of high resource consumption. To solve this problem, we introduced Filebeat. Filebeat is based on logstash-forwarder source code transformation, written in Golang, do not rely on the Java environment, high efficiency, less memory and CPU, very suitable for running on the server as an Agent.

Let's look at the basic usage of Filebeat. Write configuration files to parse log data from Nginx access.log

# filebeat.ymlfilebeat.prospectors:- input_type: log paths: / var/log/nginx/access.log json.message_key:output.elasticsearch: hosts: ["localhost"] index: "filebeat-nginx-% {+ yyyy.MM.dd}"

Let's take a look at the pressure test data.

Pressure testing environment virtual machine 8 cores 64g memory 540g SATA disk Logstash version 2.3.1Filebeat version 5.5.0 pressure test scheme

Logstash / Filebeat reads 350W logs to console, and one line of data 580B, eight processes write to the collection file.

Pressure test result item workerscpu usr takes a total of time to collect speed Logstash853.7%210s1.6w line/sFilebeat838.0%30s11w line/s

The CPU consumed by Filebeat is only 70% of that of Logstash, but the collection speed is 7 times that of Logstash. From our application practice, Filebeat does solve the problem of resource consumption of Logstash with low cost and stable quality of service.

Finally, I would like to share with you some lessons of blood and tears. I hope you can learn from me.

1. Indexer automatically hangs up after running for a period of time.

Suddenly, one day, the monitoring found that the log was no longer consumed, and it was found that the indexer consuming Kafka data was down. Therefore, the Indexer process also needs to be monitored with supervisor to ensure that it is running all the time.

2. Java exception log output

At first, when we cut the log through grok, we found that after the Exception log output of Java, there would be a problem of line wrapping. Later, the Logstash codec/multiline plug-in was used to solve the problem.

Input {stdin {codec = > multiline {pattern = > "^\ [" negate = > true what = > "previous"}} 3. The log has an 8-hour jet lag due to the time zone

The Logstash version 2.3 date plug-in is configured as follows. Check the parsing results and find that @ timestamp is 8 hours earlier than China time.

The solution Kibana reads the browser's current time zone and then converts the display of time content on the page.

Date {match = > ["log_timestamp", "YYYY-MM-dd HH:mm:ss.SSS"] target = > "@ timestamp"} 4.Grok parse failure

We came across an online node log and suddenly couldn't check it for a few days. Later, pull out the original log comparison and found that the generated log format is incorrect, including both JSON format and non-JSON format logs. But when we parse with grok, we use the json format. It is recommended that you output logs to ensure that the format is consistent and do not appear abnormal characters such as spaces. You can use online grok debug (http://grokdebug.herokuapp.com/)) to debug the rules.

Summary

The advantages of ELK stack-based logging solution are mainly reflected in

Scalability: a highly scalable distributed system architecture is designed to support daily new data at the TB level. Easy to use: through the user graphical interface to achieve a variety of statistical analysis functions, easy to use, quick response: from log generation to query visible, can achieve seconds to complete data collection, processing and search statistics. Dazzling interface: on the Kibana interface, you can complete the search and aggregation functions with the click of the mouse to generate a dazzling dashboard

Reference material https://www.elastic.co/guide/en/beats/filebeat/1.3/filebeat-overview.htmlhttps://zhuanlan.zhihu.com/p/26399963

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.