How to use ELK to build a TB-level log monitoring system 07/12 Update SLTechnology News&Howtos

How to use ELK to build a TB-level log monitoring system

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces how to use ELK to build a TB-level log monitoring system, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

In the enterprise micro-service environment, running hundreds of services can be regarded as a relatively small scale. In the production environment, logs play a very important role, troubleshooting anomalies need logs, performance optimization needs logs, business troubleshooting needs business, and so on.

However, there are hundreds of services running in production, each of which is simply localized. When logs are needed to help troubleshoot problems, it is difficult to find the node where the logs are located. It is also difficult to mine the data value of business logs.

Then the unified output of the log to a place for centralized management, and then the log processing, output the results into operation and maintenance, research and development of available data is a feasible solution to solve log management, assist operation and maintenance, but also the urgent needs of enterprises to solve the log.

Our solution

Through the above requirements, we have launched a log monitoring system, as shown in the figure above:

Unified log collection, filtering and cleaning.

Generate visual interface, monitoring, alarm, log search.

An overview of the functional process is shown in the figure above:

Bury the site on each service node and collect the relevant logs in real time.

Unified log collection service, filtering, cleaning logs to generate a visual interface, alarm function.

Our architecture

On the ① log file collection side, we use FileBeat, and OPS manages the interface configuration through our backend. Each machine corresponds to a FileBeat, and the Topic corresponding to each FileBeat log can be one-to-one or many-to-one. Different policies are configured according to daily log volume.

In addition to collecting business service logs, we also collect MySQL slow query logs and error logs, as well as other third-party service logs, such as Nginx.

Finally, combined with our automated publishing platform, each FileBeat process is automatically released and started.

② call stack, link, process monitoring indicators we use the proxy method: Elastic APM, so that there is no need to change the business side of the program.

For business systems that are already in operation, it is not advisable and unacceptable to change the code in order to join the monitoring.

Elastic APM can help us collect the call link of the HTTP interface, the internal method call stack, the SQL used, the CPU of the process, memory usage metrics, and so on.

Some people may have doubts, with the use of Elastic APM, other logs can basically not be collected. Why do you need FileBeat?

Yes, the information collected by Elastic APM can help us locate more than 80% of the problems, but it is not supported by all languages such as C.

Second, it cannot help you collect the non-Error logs and the so-called critical logs you want, for example, if an error occurs when an API is called, you want to see the logs before and after the error time, and print logs that are convenient for business analysis.

Third, custom business exceptions, which belong to non-system exceptions and belong to the business category. APM will report such exceptions as system exceptions.

If you give an alarm in the face of system anomalies later, these exceptions will interfere with the accuracy of the alarm, and you cannot filter business exceptions, because there are many kinds of custom business exceptions.

③ at the same time, we made a second opening to Agent. Collect more detailed GC, stack, memory, thread information.

The ④ server collects us using Prometheus.

⑤ because we are Saas services, there are many services, many service logs can not be unified and standardized, which is also related to the problems left over from history. A system that has nothing to do with the business system indirectly or directly docks the existing business system and allows it to change the code in order to adapt itself, which cannot be pushed.

A good design is to allow yourself to be compatible with others and take each other as the object of attack. Many logs are meaningless, such as printing iconic logs in if else to facilitate tracking problems during development, representing whether an if code block or an else code block is gone.

Some services even print Debug-level logs. Under the condition of limited cost and resources, all logs are unrealistic, even if resources permit, it will be a large expense in a year.

Therefore, we use filtering, cleaning, dynamic adjustment of log priority collection and other schemes. First of all, collect all the logs in the Kafka cluster and set a short validity period.

What we currently set is an hour, an hour of data, and our resources are acceptable for the time being.

⑥ Log Streams is our stream processing service for log filtering and cleaning. Why do you need an ETL filter?

Because our log service resources are limited, but not right, ah, the original logs scattered on the local storage media of various services also need resources, ha.

Now we are just pooling, ha, after collecting, the original resources on various services can release some of the resources occupied by the log.

Yes, it is true that the original resource utilization of various services is allocated to the log service resources, and there is no increase in resources.

However, this is only theoretical, online services, resources are easy to expand, contraction is not so easy, it is extremely difficult to implement.

Therefore, it is impossible to allocate the log resources used on each service to the log service in a short time. In this case, the resource of the log service is the amount of resources currently used by all service logs.

The longer the storage time, the greater the resource consumption. If the cost of solving a problem that must be solved or must be solved in a short period of time is greater than the benefits of solving the current problem, I think, in the case of limited funds, no leader or company is willing to adopt the plan.

Therefore, in terms of cost, we introduce a filter into the Log Streams service to filter worthless log data, thus reducing the resource cost used by the log service.

We use Kafka Streams as the ETL stream processing. The rules of dynamic filtering and cleaning are realized through interface configuration.

The general rules are as follows:

Interface configuration log collection. Full log collection at the default Error level.

Take the error time point as the center, open a window in the stream processing, and collect non-Error level logs at N time points that can be configured above and below the radiation. Only info level is selected by default.

Each service can be equipped with 100 key logs, which are collected by default.

On the basis of slow SQL, configure different time-consuming filtering according to business classification.

Real-time statistics of business SQL according to business requirements, such as the peak phase, statistics of the query frequency of similar business SQL within one hour. It can provide DBA with the basis for optimizing the database, such as creating an index by the SQL of the query.

During peak hours, logs are dynamically cleaned and filtered according to the weight index of business type, log level index, log maximum limit index of each service within a period of time, time period index and so on.

Dynamically shrink the time window according to different time periods.

Log index generation rules: generate the corresponding index according to the log file rules generated by the service. For example, if a service log is divided into debug, info, error and xx_keyword, then the generated index is also suffixed with debug, info, error, xx_keyword and date. The purpose of this is to use logs habitually for research and development.

⑦ visual interface we mainly use Grafana, it supports many data sources, including Prometheus and Elasticsearch, and Prometheus can be described as seamless docking. On the other hand, Kibana is mainly used for visual analysis of APM.

Log visualization

Our log visualization is shown below:

Thank you for reading this article carefully. I hope the article "how to use ELK to build a TB-level log monitoring system" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.