Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

(above) excavating the infinite value of the traditional industry log big data

2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Cymbal

At 8: 00 p.m. on August 27, Qiniuyun Senior solution architect Cheng Xuesong gave a live broadcast entitled "Mining the Infinite value of traditional Industry Log big data" in IT, deeply analyzed the common difficulties of traditional industry operation and maintenance and the necessity of unified log management, and explained in detail how to tap the unlimited value of traditional industry log big data through some real user cases of Pandora.

This article is the arrangement of the live broadcast content, which is divided into two parts. The first part mainly introduces the common difficulties of traditional industry operation and maintenance and the necessity of unified log management, as well as several typical scenarios of log analysis.

What is operation and maintenance?

First of all, let's talk about what operation and maintenance is. Cymbals

Cymbal

Many people have their own understanding of operation and maintenance, and they think that operation and maintenance is a very simple thing. When our enterprise buys some information products, hardware, software, etc., we need a team to make it work normally. But in the process of operation, there will inevitably be a variety of problems, which requires a special team to do the guarantee. If you simply understand the operation and maintenance as a platform, I think this understanding may be relatively superficial. What on earth is operation and maintenance? There is a lot of understanding on the Internet, about the division of operation and maintenance work, including website operation and maintenance, system operation and maintenance, network operation and maintenance, database operation and maintenance, IT operation and maintenance, operation and maintenance development, operation and maintenance security. From the perspective of these division of labor, operation and maintenance is actually a complex and systematic project.

The value of operation and maintenance

Operators need to know the exact bottleneck of the system, and then know the accurate capacity of the system; before the bottleneck of the system, they know how to provide capacity quickly.

Knowing the risk points of the system, we can coordinate the relevant modules above and below the risk points and make a redundancy strategy; it is more reasonable than focusing on solving the stability of the single point module.

Long-term engaged in related work, accumulate more experience in architecture design, can guide the new architecture design and review.

From the perspective of different businesses of the company, operators can abstract the same modules and carry out unified management to form internal capability platforms and infrastructure platforms, including some micro-services that we can share, so as to form such an effective platform and automated management methods.

The general situation of the existing operation and maintenance personnel and the challenges of the operation and maintenance personnel

From the value of operation and maintenance, we understand that operation and maintenance is a complex and systematic project. For the operation and maintenance engineer, there is a lot of work to be dealt with every day, so it is very important to help the operation and maintenance engineer to do the daily operation and maintenance work well. But now the operation and maintenance engineers encounter a lot of problems in the daily operation and maintenance, the main reason is that the IT environment is becoming more and more complex. Because the information construction is not achieved overnight, the company will build different business systems, different application support, and purchase different hardware equipment at different stages. However, due to the mutual progression and stacking of the procurement cycle, in fact, there are many different types of network equipment, a large number of different models of servers, a variety of virtualization schemes, different operating systems, a variety of applications and databases.

In fact, many databases are determined by application software developers, some developers are more familiar with MySQL, he may use MySQL as the application support database, some developers have been using Oracle, he may use Oracle to do application support. Different business softwares and different business systems will have different business architectures and underlying different platforms, and each platform will bring different monitoring systems and related tools within itself, which will cause the environment of the IT department of an enterprise to become very complex, which will lead to many problems:

The monitoring software is numerous and complicated, so it can not be managed uniformly.

There are all kinds of shortcomings in the disorganized monitoring mode of monitoring and alarm, which can not be perceived in time when the problem occurs.

The troubleshooting time is long, the system is complex, and the troubleshooting process is long, so it is impossible to locate the cause of the problem quickly and accurately after the problem occurs.

The overall situation is weak and unable to have a comprehensive control over the overall situation, so it is impossible to effectively predict the occurrence of problems.

Security challenges make it impossible to find security problems efficiently, such as intrusion and illegal operations.

In the face of many heterogeneous monitoring software, administrators need to bear a great mental burden.

Operation and maintenance management through logs

At present, a large number of operation and maintenance teams manage operation and maintenance through logs. What is the reason?

The log system records every condition information that our system runs in the form of text or log. This information can be understood as a record and projection of the behavior of devices or ordinary people in the virtual world. This information is helpful for us to observe the normal state of the system and the ways to quickly locate the wrong location when the system is running wrong.

There are many types of logs, including system logs, application logs and security logs, as well as logs from many databases, and so on. Each log records the description of the time stamp, the name of the related device, the user and the operation behavior and so on. Through the log, the system operation and developers can understand the software and hardware information of the server, check the errors in the configuration process and the causes of the errors. Regular analysis of the log can understand the server load, performance security, timely analysis of related problems, trace the root causes of errors to correct errors.

Below we give a few related examples, you will also encounter some of these monitoring or security logs in your daily work.

Cymbal

The daily analysis of logs is mainly based on the following scenarios:

Centralized monitoring of computer room

The first is the centralized monitoring of the computer room, especially now there are a large number of different brands of servers and network equipment in the construction of many computer rooms, especially large enterprises, who are often reluctant to purchase single-brand servers. in order to avoid the risk that some manufacturers rely on, there will be different brands or even heterogeneous equipment in the computer room, and the operation and maintenance personnel need to have a management and control platform for the computer room. Switch, server and other related hardware devices, including some software logs that you may be involved in, as well as security system logs, business logs, user access behavior logs, and so on. Collect these logs uniformly to form a monitoring of the daily running status of a computer room.

The example diagram above is the large screen we made for the customer in a case, which can reflect the operation status of the whole computer room. The operation and maintenance staff can intuitively know the overall daily operation status of the computer room through the big screen. The following is the architecture diagram we designed. We collected some monitoring indicators of the relevant hardware and software from the switches and servers, and then read them into our log management system to store, analyze, monitor and alarm the logs uniformly, and finally form such a large screen display. This is the most classic scene in the use of logs by many operation and maintenance students.

Applied quality management

Cymbal

Cymbals

The second is to apply quality management, that is, APM. Because all business systems will produce some business system logs in the process of running, we can quickly analyze the quality of service of the whole application for end users by collecting business system logs.

For example, the enterprise has an OA system, we usually go to the OA system to query the organizational structure of the enterprise, the daily flow of some electronic flow, including some business applications for examination and approval, will produce a large number of logs. When we analyze these logs, we can see what the average response time of the service is, or how often people use the platform, we will be able to comprehensively manage and track the quality of the application. Once we found out that everyone was complaining that my OA was slow to open, the feedback of my entire data query was slow. What's the problem? We apply the quality management module to query the corresponding point of failure and then optimize the quality of this application to provide a better experience for the final users. Application quality management will be used not only in Internet enterprises, but also in many traditional enterprises on a daily basis.

Thank you 


Unified log management platform

Cymbal

The third is called the unified log management platform, which is actually a deeper extension of scenario 1 and scenario 2. At first, we may just do a monitoring for the equipment in the computer room, and then we hope to be able to monitor the higher-level business systems and application systems. Well, now we hope to collect all the logs in the enterprise as long as they can generate logs. It includes the logs generated by the development team during the development process, including the logs generated during the business operation, including the logs of the computer room operation and maintenance, and so on. These logs are collected together to form a unified log warehouse, which is similar to our traditional understanding of data warehouse.

Data warehouse is to store all the business data and structured data together for subsequent data analysis. The unified log management platform is to collect the logs generated by all enterprises together, and then you do real-time or offline data analysis, and then output the results through the interface or through the message queue to support specific business applications. Relevant personnel can retrieve and analyze these logs to locate the problem more quickly and continuously mine the value of the data. Now many enterprises are gradually developing, not only building a unified data management platform within the enterprise, but also building a unified internal log management platform.

Data Analysis and Monitoring of Internet of things

Cymbal

The fourth one combines the industry 4.0 or made in China 2025, which is vigorously promoted by the state, in fact, it hopes to better support the development of the manufacturing industry by means of the Internet of things. Now many manufacturing enterprises will add a lot of Internet of things probes or sensors to their own production lines to collect all kinds of receipts generated during the operation of the entire production line. For example, the temperature and humidity of the workshop, including machine speed, pressure, flow and so on. Then collect these data back to my data platform in the way of data flow, collect and analyze the data in real time, such as data winding statistics or real-time data monitoring, once there is abnormal temperature and humidity, abnormal rotational speed, abnormal pressure, abnormal flow, the system needs timely alarm, workshop managers can solve the problems in time.

In addition, it is also necessary to timely monitor the operation of my entire production line for a period of time, even combined with my quality control, quality management, and so on. Be able to find out some causal relationship between the temperature and humidity index on the production line and the actual production quality. This is some of the attempts that many enterprises are now doing in the Internet of things. These four I think are some of the scenarios and problems of log operation and maintenance encountered by traditional and emerging industries.

The necessity of unified log management

Therefore, we obviously feel that unified log management is a very important thing for traditional industries, which can not only solve the problems of operation and maintenance in traditional industries, but also improve the capabilities of some enterprises at the business level. including being able to support decision-making and development of many business aspects in the future. In the past, logs were scattered on various servers, without centralized management, difficult to do association analysis, and even deleted.

To take a simple example, a lot of security data such as traditional firewall IPS are stored in their own log systems, and now there are very few enterprises to do security log association analysis, and data like this is often greatly wasted. If you manage dozens or hundreds of servers, you are still checking logs using the traditional method of logging in to each machine in turn. It feels tedious and inefficient. As a top priority, we need to use centralized log management to collect and summarize logs on all servers. In the era of big data, the number and variety of logs are huge, and enterprise data is like a gold mine that needs to be developed urgently. But with the unified concentration of logs, the statistics and retrieval of logs will become more difficult. Traditionally, we generally use Linux commands such as grep, awk and wc to achieve log retrieval and statistics. However, it is a bit inadequate to use this method for higher query, sorting and statistics requirements and a large number of machines.

Technical Choice of Log Management

For log management, there are many technical options, the most traditional and simple is to use scripting tools such as grep/sed/awk, without additional tool support, and many operation and maintenance engineers have the ability to write scripts independently, but inefficient and error-prone. Later, the data can also be collected into MySQL for unified data aggregation and some simple calculations, although it is easy to use, but due to the performance problems of MySQL itself, the support for the amount of data will not be very large, so the capacity is limited. Some enterprises will use NoSQL database to support the storage of a large amount of data, but it does not support cross-query and full-text retrieval. When looking up a specific piece of log information, the burden will become great.

Later, there were many big data technologies, such as Hadoop/Spark/Storm, which can easily collect data offline, in real time, or in a data stream way, but the use is relatively complicated, which is relatively demanding for our operation and maintenance team and IT department, and does not support full-text search. So there are not many companies that use Hadoop/Spark for log management. Now the vast majority of log management people will use ELK, you can easily download and install on the Internet to use, but ELK production and experience level optimization is far from enough, when some small batches of data want to try out the function, there is no problem. But if you want to use ELK to make a unified log warehouse or enterprise log center for the whole computer room or the whole enterprise, its stability and ease of use will be greatly challenged. Especially if you have 100 terabytes of data, you will encounter a lot of problems using ELK.

Key points of attention in the construction of log management system

So how on earth do we choose a log management system to support our internal operation and maintenance or to support our log analysis? I think it may be necessary to think about the key points of the construction of log management platform through eight angles, that is, data collection, cleaning, storage, search, monitoring and alarm, analysis, reporting, and opening.

Cymbals

Cymbal

data acquisition

Data acquisition seems to be a very simple concept. However, it can be subdivided into four function points: data collection, analysis, conversion and transmission.

Cymbal

Data acquisition requires log management platform to support a variety of data sources, which is a necessary function as an excellent data acquisition platform. It includes relational databases such as MySQL, Oracle, and maybe even SQL server. And non-relational database, message queue, ES and other search platforms, including Hadoop services, the accurate collection of data from these data sources is a function that must be supported by a data management platform in data collection. In addition, it is best to collect hardware indicators, such as the CPU of the server, the memory utilization of the server, the utilization of storage, and the network traffic of network devices. These indicators may not be presented in the form of logs, but you need to have a relevant collection tool that can be deployed on the server or on network devices to collect monitoring indicators of the underlying hardware. This is also some of the capabilities that the data acquisition platform needs to reflect in collecting this function.

In fact, many logs are recorded in the form of text. if you want to do in-depth log analysis, statistics and calculation, you need to extract and slice the contents of the log. For example, a log of a security device needs to be divided into several fields, such as specific time, log source device, security event name, specific description, and so on. The first function that the log management platform needs to consider is to support very rich predefined parsing rules, no matter what log format it comes in, it is very convenient to parse these data into related fields.

The second is for the personalized log format, which can support custom log resolution rules, because the log must be defined by each application developer in the process of system development, including log-related formats, contents and rules. So this will cause a hundred flowers to blossom and the logs of each company will be different. Well, if we only use the same set of parsing rules to analyze different logs of different systems, there will be a situation of disobedience. Therefore, if users can easily customize the resolution rules for these logs, for example, the log for a sample can be divided into several fields by marking words, and the system automatically generates relevant resolution rules, in this way, it will be very convenient and easy to use for the daily use of operation and maintenance.

After collection and parsing, there is also data conversion, so why is there still a conversion work? Because for some fields in the log, we want it to be more readable. For example, if a user in the intranet accesses a certain business system, the log system must record the source IP address of the access. But when I want to analyze the log later, I don't really care about the IP address. What I care about is the account corresponding to the IP address or the specific person, so we need a translation process at this time. Convert the IP address to the corresponding entity. Through these conversion rules, the operation and maintenance staff can be more accurate in the analysis and statistics of the subsequent data, and the use process is easier to use.

Therefore, collection, analysis and transformation are all very important work, and none of these links are indispensable. Finally, after processing the data, we need to send it to a storage for persistent storage or for subsequent analysis. Then collection, analysis, transformation and transmission are the four small aspects that need to be considered in the function point of data collection.

Data processing.

Cymbal

After the data collection is completed, some deep processing of the data may be needed. Some simple data can be used for analysis or search without processing. For some complex business scenarios, for example, after a large amount of data is collected, you need to make a simple calculation and statistics of the data every five minutes or every ten minutes, or for some business applications with high real-time requirements. After the data is collected in real time, it is matched with the existing business model or security model to implement business services or security situation monitoring. Only through the data acquisition platform can not meet the demand. What is needed at this time is a powerful data processing platform, preferably a big data computing engine like Hadoop and Spark, which can perform real-time or offline calculations for different kinds of data sources, and support periodic scheduling such as scheduled and circular execution of tasks, and finally export the results of calculation and analysis to object storage and log analysis. Or export to the business database to directly support the subsequent actual production business.

Data analysis

After data acquisition and processing, we can enter the stage of data analysis. in this link, we need to conduct a comprehensive and rapid analysis of the collected logs and display the results, then first of all, we need to store the logs uniformly, which needs to support at least TB or even PB data, and can support fast search of these data. To form relevant charts and support relevant monitoring, alarm or analysis and prediction, the log management platform also needs to provide relevant API interfaces to connect with third-party monitoring platforms, monitoring tools or directly support business systems such as precision marketing and user portraits. These are the functions that data analysis needs to support in this process. In my daily communication with many users, I will also find that they will more or less encounter some pain points of log analysis business. I summarized four points as follows: 


Cymbal

Automatic field analysis

Log parsing has been completed in the log collection phase, and a standard text log has been parsed into several fields, so can you do some automatic statistics and analysis of these fields? the operation and maintenance personnel no longer need to do the data calculation by writing scripts and editing tasks. For example, the system can automatically tell you what the average traffic in the network is, what are the peak and minimum values of your traffic. If there are some error logs, we will figure out which errors are the errors of your TOP10, which user or which device it comes from. Analysis of such fields can greatly reduce some of the work or difficulty of computing or task configuration that users do in the process of using this platform.

Joint search

As the name implies, multiple log repositories are searched simultaneously through one condition. In this scenario, for example, firewalls, IPS, antivirus software, and access logs may exist in different places. After they are uniformly collected on the log management platform, they are usually placed in different log repositories. When a security event occurs, the security event will include × × from which IP address or from which user name. Then I need to be able to retrieve the logs of all security devices through this IP address or user name, and then display the relevant content uniformly, then there is a joint search scenario at this time. At this time, you need to have such a function to search for all the content that can be seen in this log repository.

Word analysis

When you use the log analysis function on a daily basis, not all tasks are fixed, and sometimes you need to change flexibly according to business requirements. For example, today I need to analyze the daily access behavior of a device or a user, then I will search the user name of that user, and the log management platform will list all the content that meets the criteria. But when you take a closer look, the search will find a lot of content, maybe hundreds or even thousands of related logs, if the sensor log may be more. Only through a search criteria, often can not meet your needs for log analysis. At this point, you can choose to add an and search condition to the search box to filter the log results at a deeper level.

But is there an easier way? For example, now that all the logs related to this user name have been found, is it possible to draw a paragraph in a log in the search results and automatically fill it into my search box to filter the search results of the data twice? or I can rule out the log content corresponding to these words in the search results. If this function can be achieved, it can greatly improve the ease of use of the platform to solve a lot of everyday collapsing things. This is a pain point in word analysis.

Real-time search

All the logs generated in the system will be continuously collected to the log platform in the way of data flow. When searching for the logs, we hope that the new logs can also be displayed in real time. In this way, when I go to make changes to a business, or recover from a failure, I can see the latest log situation, and it is easy to see whether the business is back to normal. This is a bit like the real-time scrolling of tail-f data that we use on a daily basis. This is also a pain point that many users will encounter in the process of data analysis. If there is a product that can solve these pain points of users and reduce the burden of using the platform, it can greatly reduce the pressure of daily operation and maintenance and improve the overall work efficiency.

Cymbal

The awesome man said

The column of "Niu Ren Shuo" is devoted to the discovery of the thoughts of technical people, including technological practice, technical practical information, technical insights, growth experiences, and all the contents worth discovering. We hope to gather the best technical people to dig out unique, sharp, contemporary voices.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report