Technical Architecture selection of Log Analysis system under the quantity of hundreds of billions 04/23 Update SLTechnology News&Howtos

Technical Architecture selection of Log Analysis system under the quantity of hundreds of billions

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Cymbal

As data has gradually become a valuable asset of a company, big data's team tends to play a more important role in the company. Big data team often has to undertake important responsibilities such as data platform maintenance, data product development, mining business value from data products, and so on. So for many big data engineers, how to choose appropriate big data components according to business needs and do appropriate big data architecture work is the most common problem in daily work. Here, according to Qiniuyun's log analysis work at the daily growth level of hundreds of billions, I would like to share with you some experience in the selection of big data's technical architecture.

What is big data architect focusing on?

In a big data team, the core issue that big data architect mainly focuses on is the selection of technical architecture. What factors will affect the problem of architecture selection? In our practice, the general big data domain architecture selection is most affected by the following factors:  

Cymbal

Data magnitude

This is a particularly important factor in the field of big data. Fundamentally, however, the magnitude of data itself is a measure of the business scenario. Differences in data levels often indicate different business scenarios.

Business requirements

The experienced big data architect can extract the core technology points from the numerous business requirements and choose the appropriate technology architecture according to the abstract technology points. The main business requirements may include: application real-time requirements, query dimensions and flexibility, multi-tenancy, security audit requirements, and so on.

Maintenance cost

At this point, big data architect on the one hand should be able to clearly understand the advantages and disadvantages of various big data technology stacks, to meet the requirements of business needs, can fully optimize the architecture, a reasonable architecture can reduce maintenance costs and improve the efficiency of development.

On the other hand, big data architect should be able to clearly understand his own team members, understand the technical expertise and taste of other students, and ensure that his technical architecture can be recognized and understood, and can be best maintained and developed.

Next, we will focus on these aspects to see how the architecture selection that best suits our team's business will be affected by these factors.

Technical architecture selection

Business requirements are varied, and what often affects our technology selection is not the details of the requirements, but some specific scenarios after refinement. For example, business requirements suggest that we should do a log analysis system or a user behavior analysis system. What specific points should we focus on behind these specific requirements? This is a very interesting question. In the process of being big data, we often find that our questions about these needs often fall on the following questions.  

Cymbal

Among them, the level of data, as an important factor, affects our decision on technology selection. In addition, in addition to the changes in the amount of data, the needs of various business scenarios will also affect our choice of technology components.  

Data magnitude

As we mentioned above, the indicator of data magnitude is a measure of a special business scenario, and it is also a factor that has the greatest impact on big data's application. Often corresponding to different data levels of business, we will have different ways of thinking.

The general data order of magnitude is about 10GB, and the total number of data is in the order of 10 million. This kind of data is often the core data of the business, such as user information base and so on. Because of its core business value, this amount of data often requires strong consistency and real-time performance. At this level, traditional relational databases such as MySQL can solve all kinds of business requirements well. Of course, if faced with problems that are difficult to solve in relational databases, such as full-text indexing, architects still need to choose search engines such as Solr or Elasticsearch to solve such problems according to business needs.

If the data level increases to 100 to 1 billion, generally speaking, this stage will be faced with a choice, is to adopt the traditional RDBMS+ reasonable index + sub-database sub-table and other strategies? Or should you choose some components such as SQL On Hadoop or HTAP or OLAP? At this time, the flexibility is actually relatively large. Generally speaking, our experience is that if there are expert engineers in the database and middleware direction in the team who want to keep the architecture simple, they can choose to continue to use traditional relational data. However, if in order to have higher scalability for the future business and be able to support a wider range of business requirements in the visible time, it is recommended to use big data components.

When the amount of data has grown to 1 billion to 10 billion, especially after 10TB, often our traditional relational database has been basically excluded from the optional technical architecture. At this time, we often have to combine a variety of business scenarios to select the technical components of a specific scenario. For example, we need to carefully examine whether our business scenario requires a large number of update operations. Do you need random reading and writing ability? Do you need a full-text index?

Cymbal

These are the general performance results of some mainstream analytical engines at various data levels, and the data in this chart is only the general performance in most scenarios (not accurate test results, for reference only). However, it is worth noting that although it seems that we always want to respond as little as possible, and the higher the order of data, the better, we should know that there is no silver bullet in big data's field, which can solve all the problems. Each technical component sacrifices some of the scenarios in order to maintain an edge in its own field.

Real-time performance

Real-time is such an important factor, so we must focus on considering the real-time requirements in the business requirements from the very beginning. Real-time in business often has two meanings:  

On the one hand, real-time performance is reflected in the real-time performance of data intake, which means that when business data changes, how much delay can our big data app accept to see this data? Ideally, of course, the business wants the system to be as real-time as possible, but considering this problem in terms of cost and technology, we are generally divided into real-time systems (millisecond delay), near-real-time systems (second-level delay), quasi-real-time systems (minute-level delay) and offline systems (small time or day delay). In general, the delay time and throughput capacity are inversely proportional to computing power. The stronger the throughput, the more accurate the calculation, and the longer the delay time.

On the other hand, the real-time performance is also reflected in the query delay, which calculates how long the user has to wait after issuing the query request, and the server can return the calculation results. In most cases, it depends on the specific form of the product. If the product is to be displayed to end users, such as statistical products such as Fengyun list, Hot search list, recommended products, and products with high QPS demand, it is necessary to control the delay in sub-seconds. In another scenario, if a product is for data analysts or operators to explore and use data, it will often go through large-scale and uncontrollable computing, which may be more suitable for an offline task mode. Users will also have a higher degree of tolerance, supporting data output at the level of minutes or even hours.

Cymbal

As can be seen from this figure, in the real-time domain, we generally choose HBase,Cassandra, a technical component that can support transactions and high update throughput, or we can choose HTAP components such as TiDB, Spanner, Kudu, etc., which also support transactions and analysis.  

If you pursue higher analysis performance, you can choose professional OLAP (On-Line Analytical Processing) components, such as Kylin or Druid, which belong to MOLAP (Multi-dimensional OLAP). They support the creation of data cubes in advance and pre-aggregation of metrics, although a certain degree of query flexibility is sacrificed, but real-time query is guaranteed.

Elastic Search is the most flexible NoSQL query engine. On the one hand, it supports full-text indexing, which other engines do not have. In addition, it also supports a small number of updates, aggregate analysis, and the search and query of detailed data, which is very suitable for many scenarios in the near real-time field. However, because ES is a Lucene-based storage engine, the relative resource cost will be higher, and the analysis performance does not have an advantage over other engines.

In addition, if our data is archived offline or appended, and the product shape needs to rely on the operation of a large number of data. This product can often tolerate high query latency, so a series of products of Hadoop Ecology will be very suitable for this field, such as the new generation of MapReduce computing engine Spark, another series of SQL On Hadoop components, Drill,Impala,Presto and so on have their own advantages, we can combine with other business needs to choose.

Calculate Dimensions / flexibility

Computational dimension and computational flexibility are two important factors for computational selection. Just imagine, if our product only produces a fixed number of indicators, we can use Spark offline calculation to import data results into business databases such as MySQL, and provide display services as result sets.

But if our query is interactive, and if users can select dimensions for data aggregation, we cannot predict the permutations and combinations of all dimensions, then we may need an OLAP component that can pre-aggregate metrics according to the specified dimensions. This selection can enhance the flexibility of the result display and greatly reduce the query delay.

Further, if users can not only calculate the data metrics, but also be able to query the original details, and the OLAP component may no longer be applicable, they may need a more flexible component such as ES or SQL On Hadoop. At this point, if there is a need for full-text search, then choose ES, if not, choose SQL On Hadoop.

Multi-tenant

Multi-tenant requirements are also an issue that big data architects often need to consider, and multi-tenant requirements often come from many different users, which is very common for a company's infrastructure department.

What should multi-tenants consider?

The first is the isolation of resources. from the perspective of resource saving, it is certain that if resources can be shared among different tenants, resources can be fully utilized. This is what we generally want to do in the infrastructure department. However, for many tenants, the business level may be higher, or the amount of data may be larger, and sharing resources with ordinary tenants may lead to competition for resources. At this point, the isolation of physical resources should be considered.

Second, it is necessary to consider user security. On the one hand, to do authentication, we need to put an end to malicious or ultra vires access to data. On the other hand, to do a good security audit, each sensitive operation to record the audit log, can be traced back to the source of each behavior IP and operating users.

The third but most important point is data permissions. Multi-tenant system not only means isolation, but also means that resources can be shared and used more reasonably and effectively. Nowadays, data permissions are often not limited to the read and write permissions of a file or a warehouse. More often, we may have to authorize a subset of data and some data fields, so that each data owner can distribute his or her resources more securely to the tenants in need. Data can be used more efficiently, which is also an important mission of a data platform / application.

Maintenance cost

For architects, the maintenance cost of big data platform is a crucial indicator, and experienced architects can choose the right technical solution according to the characteristics of their own team.  

Cymbal

As can be seen from the figure above, the big data platform can be divided into four quadrants according to service dependence (whether to rely on cloud services or self-built big data platform) and the complexity of technical components.

The use cost is proportional to the complexity of technical components. Generally speaking, the higher the complexity of components and the more the number of components, the higher the cost of using multiple components together.

Maintenance costs are related to service providers and component complexity. Generally speaking, the maintenance cost of a single technical component is lower than that of complex technical components, and the maintenance cost of technical components provided by cloud services is lower than that of self-built big data components.

In terms of team requirements, generally speaking, the convergence with the cost of use is that the more complex the technical components, the higher the team requirements. However, on the other hand, the team requires that there is also a relationship with the service provider. If the cloud service provider can undertake the operation and maintenance of the components, it can actually help the business team liberate more engineers from the operation and maintenance work. Participate in the work of big data application.  

So in general, the architect's preference for technology selection should be to choose the easiest technology architecture on the premise of meeting business needs and data volume requirements, because this selection is often the easiest to use and maintain. On this basis, if there is a very strong technical development and operation and maintenance team, you can choose to build big data platform; if there is a lack of sufficient operation and maintenance, development support, then it is recommended to choose a cloud service platform to support the business.

How does Qiniuyun do architecture selection?

The big data team of Qiniuyun is called Pandora. The main work of this team is to support the needs of big data platform in Qiniuyun. In addition, it is also responsible for producing big data platform and providing professional big data services to external customers. It can be said that Qiniuyun is the first customer of Pandora, and a lot of our experience in technology selection is accumulated in carrying various needs within the company.  

Features and business challenges of Qiniuyun

A brief introduction to the challenges we face in the Qiniuyun scene. Besides Pandora, Qiniuyun has six product teams, including cloud storage, live cloud, CDN, intelligent multimedia API service and container cloud team. All the business data and log data generated by the product team are collected into the unified log storage of Pandora through logkit (Professional Edition), a collection tool developed by Pandora. Then various departments use this part of the data to do a variety of data applications.

Cymbal

First of all, the commercial operation department is a team with the important mission of revenue and growth of Qiniuyun, which needs the buried site and log data collected by each team to create a unified user view and user portraits based on this. Provide customers with more intimate operational services and improve customer satisfaction.

In addition, the SRE team needs to do in-depth performance tracking of the online system, and we need to provide OpenTracing interface support here. Under the relatively unified technology stack of Qiniuyun, we can easily support full-link monitoring, so that the SRE department can track and monitor the online service performance without relying on the R & D team, and it is easier to know where the service is wrong.

The product R & D side has put forward the need for full-text indexing. In the daily log of nearly 100 TB, we need to be able to quickly locate the log data according to keywords and query the log context. Not only that, but also need to be able to parse the key fields in the APP log, such as user id and response time, download traffic, etc., to be able to do user-level operation and maintenance indicator monitoring, to be able to more accurately serve customers.

Cymbal

Of course, no matter which business unit puts forward the requirements, they all need an excellent and flexible report display system, which can support the business to do analysis, exploration and decision-making. Based on a reasonable architecture to support complex business reports and BI requirements.  

Landing in the structure of Qiniuyun

Taking into account the product needs of all parties, we have made the following product design:

Cymbal

We first developed logkit Professional Edition, which is used to professionally collect and synchronize data from various open source projects or log files. In addition, a set of data bus Pipeline is designed, which combines the characteristics of Qiniuyun that the data throughput is very large, but the delay can accept the delay of seconds. Here we use multi-Kafka cluster + Spark Streaming, self-developed traffic scheduling system, can efficiently export data to the downstream unified log storage products, while using Spark Streaming can easily complete log parsing, field extraction and other work.

Unified log storage here we support self-research and a variety of third-party chart display systems. We use a hybrid architecture model for the back-end data system, where the main body contains three basic products.

Log analysis platform

Based on the customized version of ES of Qiniuyun, a log storage and indexing system is built, and the cluster can still guarantee the return of billion-level data search seconds in the case of 100w/s throughput.  

Data cube

Data Cube, an OLAP product based on customized Druid, is a high-performance query for multi-tenants, providing millisecond aggregate analysis for the largest customers with 30TB + raw data per day.

Offline workflow

Based on the ability of storage and Spark workflow platform to provide offline data computing, it can handle large-scale calculation and analysis of PB-level data.

Architectural advantage

After practicing these big data practices, what kind of product does Pandora bring to internal and external users? By comparing with the industry's excellent commercial and open source products, we come to the conclusion that Qiniuyun has the following advantages:

Comprehensive multi-tenant support

Pandora provides multiple levels of isolation support in the area of multi-tenant resource isolation. Including low-level namespace isolation, we will ensure that all customers can safely use the cluster by restricting the use of CPU, memory and other shared resources. What's more, in order to meet the customized needs of more customers, we also use multi-cluster dynamic expansion to support spatial isolation between tenants, and users can use independent resources.

In addition, we have also done a lot of work on security, permissions, and auditing, which are important in multi-tenant scenarios. We can manage the data according to the granularity of the data subset and fields, and authorize the data to other tenants. At the same time, we will audit the records of every operation of the data, accurate to the source IP and operators, to ensure the data security of cloud services.

Support rich business scenarios

On the big data platform based on the log domain of Pandora, we support both real-time and offline computing models. Using the workflow interface, you can easily operate various xxxxs. Use products and tools such as log analysis and data cube to support a variety of business scenarios. Including, but not limited to:

User behavior analysis

Application performance monitoring

System equipment performance monitoring

Non-intrusive burial point, support full link tracking system, and find the bottleneck of distributed system application.

Data Analysis and Monitoring of IoT equipment

Security, audit and monitoring

Machine learning, automatic system anomaly detection and attribution analysis

Very large-scale data verification in public cloud

We have served more than 200 named customers in the public cloud, and more than 250TB data flows in every day, about 3650 billion pieces of data per day. The amount of data involved in daily calculation and analysis has exceeded that of 3.2PB. With the super-large public cloud scale, it is verified that the Pandora big data log analysis platform can provide customers with a stable computing platform and good business support.

Users enjoy the lowest operation and maintenance cost

Pandora's product design philosophy holds that cloud services should be an integrated product. Therefore, for customers, although Pandora adapts to a large number of application scenarios, it is still a single product component, so the operation and maintenance cost is the lowest for customers who adopt Qiniuyun big data service. Only one development team is needed to take care of all aspects of data development and operation and maintenance, which provides great convenience for rapid business iteration and growth.

These are some of the experiences I share with you based on the practice of Pandora.

The awesome guy said that  

The column of "Niu Ren Shuo" is devoted to the discovery of the thoughts of technical people, including technological practice, technical practical information, technical insights, growth experiences, and all the contents worth discovering. We hope to gather the best technical people to dig out unique, sharp, contemporary voices.

Cymbal

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.