Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to practice Log Storage and Analysis based on Real-time ETL

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to carry out the practice of log storage and analysis based on real-time ETL. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

The fish and bear's paw under big data's log

We are in the era of big data and diversified data (unstructured). Real-time machine data is generated rapidly. one of the cores of being a data company is how to make full use of a large amount of log data.

From this background, it also poses higher challenges to the collection, storage, analysis and management of logs, including the selection of fish and bear paws:

Fish: high cost may cause data to be deleted, thus missing out on value discovery. While the amount of data is growing rapidly, customers need to keep logs for longer and hope to reduce storage costs by half or more in corresponding scenarios.

Bear's paw: the proportion of real-time data in machine data is gradually increasing. Today, when real-time value is getting more and more attention, customers want to continue to have an interactive, one-stop experience.

How do you get both fish and bear paw? The balance between cost and experience is discussed here.

SLS is an one-stop service for machine data, providing users with fast data collection, consumption, delivery, query and analysis functions to improve operation and maintenance, operation efficiency.

When serving a large number of customers, we observed that in many scenarios, with the continuous growth of the log volume, the data showed a difference in the popularity of visits. For example:

Machine metrics are constantly updated, but on the monitoring metrics dashboard, new data is accessed much more frequently than it was a day ago.

When troubleshooting anomalies, developers pay attention to the changes in the ERROR/WARN log through tail and grep, and often do not need the program log a few days ago to locate the problem.

Data is important according to business attributes. A large amount of non-production log data has a low access probability after 7 days, while the recent production log needs to be accessed flexibly.

The following will introduce the storage strategy and practice that take into account the flexibility and economy of log data on SLS.

Business hierarchical data system Architecture based on data processing and delivery

Take SLB access log processing as an example. Multiple instance data in an area are usually stored under a full Logstore (with a delay of 10 seconds). The data processing job is configured on the Logstore to realize data preprocessing and data flow according to the service label.

For requests with errors and high latency, you need to ensure real-time query and fast statistics, and you can plan to a Logstore with SLS index.

All other production domain request logs need to be stored for a long time for audit and compliance. Temporary Logstore can be dumped (acting as a bridge) and delivered to more economical storage.

Operation and maintenance data pipelines are often complex, and the Serverless processing and delivery services provided by SLS can be used out of the box. Make the above solution easier to implement and have a cost advantage.

Data processing to realize preprocessing

For the access log of SLB layer-7 monitoring, the URI field contains a high-value business key-value field, and the UserAgent field can help monitor the quality of service and stability on each end.

Some fields of the original log before processing:

Request_uri: / api/get.convert.v2?fn=callback&url=https%3A%2F%2Fmini.yyrtv.com%2Fr%2F80ba436b763b747d.html%3Ffrom%3D320101%26site%3D1http_user_agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36

Processing DSL deals with log tidiness and extraction scenarios:

Business key-value pairs in URI are extracted by e_kv.

Automatic extraction of http_user_agent strings through ua_parse_all.

E_kv ('request_uri', prefix =' uri.') e_set ('ua', ua_parse_all (v (' http_user_agent')) e_json ("ua", depth = 1, fmt = 'root') e_drop_fields (' ua','_ _ tag__:__receive_time__')

Two business key-value pairs of fn and url are obtained by URI extraction, and the e_json function is used to further extract the results of ua_parse_all to get the structured information of equipment, OS and UA.

The result fields after processing are as follows:

Uri.fn: callbackuri.url: https%3A%2F%2Fmini.yyrtv.com%2Fr%2F80ba436b763b747d.html%3Ffrom%3D320101%26site%3D1ua.device: {"family": "Other"} ua.os: {"family": "Windows", "major": "7"} ua.user_agent: {"family": "Chrome", "major": "69", "minor": "0", "patch": "3947"} data processing to achieve data diversion

Processing provides operators to quickly realize multi-source data collection, multi-target distribution of homologous data, support saving and wholesale delivery (increasing throughput, conducive to compressed storage), and automatic retry for abnormal data writing.

The following processing DSL is implemented:

If the RS processing delay has a value and is greater than 5.0s, or the status code is not 200, this part of the data is written to the target debug.

All access logs generated by online domain names that conform to the regular expression are written to the target product-host.

E_if (op_or (op_and (op_ne (v ('upstream_response_time'),' -'), op_ge (ct_float (v ('upstream_response_time')), 5. 0), op_ne (v (' status'), '200'), e_coutput (name =' debug') e_if (e_search ('host ~ = ". *-prod\ .com"') E_output (name = 'product-host')) e_drop ()

The source Logstore does not open the index and shortens the storage period to 1 day. Save the above two DSL segments to a processing job to run, and flow the data to two downstream Logstore after real-time processing:

Debug: set the storage period to 30 days and enable the index.

Product-host: set the storage period to 1 day, and enable OSS delivery.

Calculate the geographical distribution of the request source IP with a processing delay of more than 60 seconds on the back-end server

When SLS data is delivered to the OSS data lake, there are two common scenarios:

1. Very low cost storage

When delivering, configure compression to reduce the object file size (log is generally 5 / 15 times compression ratio). Long-term cold storage of data can even choose archive storage type or low-frequency access storage type OSS bucket.

two。 Data lake storage, taking into account the medium and low frequency analysis

SLS delivery OSS provides the choice of json/csv format and parquet format, and you can build files based on custom key lists.

According to the characteristics of computing engines (Spark, DLA, etc.), choosing the appropriate file format can achieve a balance between computational efficiency and cost.

For example, using OSS select to specify object files for simple data query, a variety of storage and computing separation practices based on OSS can be accelerated by Select.

Data flow layering can be realized through connection between multiple storage entities.

On the road of log data fusion, value release and efficient utilization, SLS data processing and delivery continue to provide pipeline services to meet the needs of more diverse scenarios.

This is the end of the practice of log storage and analysis based on real-time ETL. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report