How to design the architecture of big data platform 04/20 Update SLTechnology News&Howtos

How to design the architecture of big data platform

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how to design the architecture of big data platform, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Research on the design of big data platform architecture. The definition given by the McKinsey Global Research Institute is that the scale of data collection is so large that it is far beyond the capabilities of traditional database software tools in terms of acquisition, storage, management and analysis. it has four characteristics: massive data scale, rapid data flow, diverse data types and low value density.

If there is a lack of effective overall data architecture design or partial lack of capabilities, it will make it difficult for the business layer to make direct use of big data and big data, and there will be a huge gap between big data and the business. the emergence of this gap leads to a series of problems in the process of using big data, such as unknown data, difficult demand, difficult data sharing and so on. This paper introduces some data platform design ideas to help business reduce the pain points and difficulties in data development.

In recent years, with the continuous development of IT technology and big data, machine learning and algorithm, more and more enterprises realize the value of data, manage data as their own valuable assets, and use big data and machine learning ability to mine, identify and utilize data assets. If there is a lack of effective overall data architecture design or partial lack of capabilities, it will make it difficult for the business layer to make direct use of big data and big data, and there will be a huge gap between big data and the business. the emergence of this gap leads to a series of problems in the process of using big data, such as unknown data, difficult demand, difficult data sharing and so on. This paper introduces some data platform design ideas to help business reduce the pain points and difficulties in data development.

Combine these big data components through various data platforms and components to create a set of efficient and easy-to-use data platform to improve the performance of business systems, so that business developers are no longer afraid of complex data development components and do not need to pay attention to the underlying implementation. You only need to use SQL to complete one-stop development and complete data return, so that big data is no longer a skill that data engineers have.

Big data's technology stack

The overall process of big data involves many modules, each of which is relatively complex. The following figure lists these modules and components as well as their functional characteristics. Later, there will be a topic to introduce the relevant module domain knowledge in detail. For example, data acquisition, data transmission, real-time computing, offline computing, big data storage and other related modules.

Research on the Design of big data platform Architecture _ big data Video _ data Analysis course _ data structure Video _ lesson designer

II. Lambda architecture and kappa architecture

At present, basically all big data's architectures are based on lambda and kappa architectures, and different companies have designed data architectures in line with the company's two architecture patterns. The lambda architecture enables developers to build large-scale distributed data processing systems. It has good flexibility and expansibility, and also has good fault tolerance to hardware failures and human errors. Many related articles about lambda architecture can be found on the Internet. The kappa architecture solves the two sets of data processing systems existing in the lambda architecture, which brings a variety of cost problems, which is also the current research direction of batch integration, many enterprises have begun to use this more advanced architecture.

Lambda architecture

Kappa architecture

Third, big data's architecture under the kappa framework and the lambda framework.

At present, major companies basically use kappa architecture or lambda architecture model. Under these two modes, big data's overall architecture may be as follows:

Fourth, data end-to-end pain point

Although the above architecture seems to connect a variety of big data components to implement integrated management, people who have come into contact with data development will feel strongly. Such naked architecture business data development needs to pay attention to the use of many basic tools. There are many pain points and difficulties in actual data development, which are reflected in the following aspects.

Lack of a set of data development IDE to manage the entire data development process, the long-term process can not be managed.

There is no standard data modeling system, which leads to the misunderstanding of different calculation caliber by different data engineers.

Big data component development requirements are high, ordinary business to directly use Hbase, ES and other technical components will produce a variety of problems.

Basically every company big data team will be very complex, involving a lot of links, encountered problems difficult to locate and difficult to find the corresponding responsible person.

It is difficult to break the isolated island of data, it is difficult to share data across teams and departments, and it is not clear what data each other has.

Need to maintain two sets of computing model batch computing and flow computing, it is difficult to start development, need to provide a set of flow batch unified SQL.

Lack of company-level metadata system planning, the same piece of data is difficult to reuse real-time and offline computing, each development task has to be sorted out.

Basically, most companies have the above problems and pain points in data platform governance and the ability to provide openness. Under the complex data architecture, for the data user, the ambiguity of each link or the unfriendliness of a function will make the complex link change more complex. To solve these pain points, you need to carefully polish each link and seamlessly connect the above technical components, so that the business using data from end to end is as simple as writing SQL query database.

Fifth, excellent overall architecture design of big data

Provide a variety of platforms and tools to assist the data platform: data acquisition platform of multiple data sources, one-button data synchronization platform, data quality and modeling platform, metadata system, unified data access platform, real-time and offline computing platform, resource scheduling platform, one-stop development IDE.

VI. Metadata-the cornerstone of big data system

Metadata is to open up data sources, data warehouses and data applications, recording the complete link of data from generation to consumption. The metadata contains static table, column, and partition information (that is, metaStore). Dynamic task and table dependency mapping, model definition of data warehouse, data life cycle, and metadata such as ETL task scheduling information, input and output are the basis of data management, data content and data application. For example, metadata can be used to build data graphs among tasks, tables, columns and users; construct task DAG dependencies and arrange task execution sequences; build task portraits to manage task quality; provide asset management for individuals or BU, an overview of computing resource consumption, and so on.

It can be considered that the whole data flow of big data is managed by metadata. Without a complete set of metadata design, there will be some problems, such as difficult to track data, difficult to control permissions, difficult to manage resources, difficult to share data and so on.

Many companies rely on hive to manage metadata, but I think it is necessary to build a metadata platform to match the relevant architecture at a certain stage of development.

VII. Integrated calculation of flow and batch

If you maintain two sets of computing engines, such as offline computing Spark and real-time computing Flink, it will cause great trouble to users, and you need to learn both flow computing knowledge and batch computing domain knowledge. If you use Spark or Hadoop offline with Flink in real time, you can develop a custom DSL description language to match different computing engine grammars. Upper-level users do not need to pay attention to the specific execution details of the bottom layer, but only need to master a DSL language to complete the access of computing engines such as Spark, Hadoop and Flink.

VIII. Real-time and offline ETL platforms

ETL, or Extract-Transform-Load, is used to describe the process of extracting (extract), transforming (transform), and loading (load) data from the source side to the destination side. The word ETL is more commonly used in data warehouses, but its objects are not limited to data warehouses. Generally speaking, ETL platform plays an important role in data cleaning, data format conversion, data completion, data quality management and so on. As an important data cleaning intermediate layer, generally speaking, ETL should have at least the following functions:

Support a variety of data sources, such as message systems, file systems, etc.

Support for multiple operators, filtering, segmentation, conversion, output, query data source completion and other operator capabilities

Dynamic change logic is supported. For example, the above operators can publish changes without stopping service by submitting them in a dynamic jar way.

IX. Intelligent unified query platform

Most data queries are demand-driven. A requirement develops one or more interfaces, writes interface documents, and opens them to the business side. This mode has many problems under the big data system:

This architecture is simple, but the interface granularity is very coarse, the flexibility is not high, the expansibility is poor, and the reuse rate is low. With the increase of business requirements, the number of interfaces increases significantly, and the maintenance cost is high.

At the same time, the development efficiency is not high, which will obviously cause a large number of repeated development for the massive data system, which is difficult to achieve data and logic reuse, and seriously reduce the business applicable party experience.

If there is no unified query platform to directly expose Hbase and other libraries to the business, the subsequent operation and maintenance management of data permissions will be more difficult. Accessing big data components is also painful for the business application side, and various problems will arise if you are not careful.

Solve the above big data query pain point problem through a set of intelligent query

Code system for modeling of digital warehouse

With the increase of business complexity and data scale, chaotic data calls and copies, waste of resources caused by repeated construction, ambiguity caused by different definition of data indicators, and higher and higher threshold for data use. Take the author witnessing the actual business burial point and data warehouse use as an example, some table fields of the same product name are good_id, some are called spu_id, and there are many other names, which will cause great trouble to people who want to make use of these data. Therefore, the lack of a complete big data modeling system will bring great difficulties to data governance, as shown in the following aspects:

The data standard is inconsistent, even if it is the same name, but the definition caliber is not consistent. For example, uv alone has more than a dozen definitions. The question is: it's all uv, which one should I use? They are all uv, so why are the data different?

Resulting in huge R & D costs, every engineer needs to know every detail of the R & D process from beginning to end, and everyone will step on the same "trap" again, resulting in a waste of time and energy cost of R & D personnel. This is also the problem encountered by the target author, it is too difficult to go to the actual development to extract data.

There is no unified standard management, resulting in a waste of resources such as repeated computing. The hierarchy and granularity of the data table are not clear, which also makes the repeated storage serious.

Therefore, big data development and warehouse table design must adhere to the design principles, data platform can be developed platform to restrict unreasonable design, such as Alibaba's OneDatabody. In general, data development is carried out in accordance with the following guidelines:

11. One-button integration platform

It is very simple to collect all kinds of data to the data platform with one button, and seamlessly connect the data to the ETL platform through the data transmission platform. ETL communicates with the metadata platform to standardize the definition of Schema, and then converts and diverts the data to the real-time and offline computing platforms. Any subsequent offline and real-time processing of the data only needs to apply for metadata table permissions to complete the calculation of the development task. Data acquisition supports a variety of data sources, such as binlog, log collection, front-end burial points, kafka message queues, etc.

Data Development IDE- efficient end-to-end tools

Efficient data development one-stop solution tool, through IDE can complete real-time computing and offline computing task development, all the above platforms can be connected to provide one-stop solution. Data development IDE provides a full range of product services, such as data integration, data development, data management, data quality and data services, one-stop development and management interface, through data IDE to complete data transmission, transformation and integration and other operations. Introduce data from different data stores, transform and develop it, and finally synchronize the processed data to other data systems. The efficient development of IDE by big data basically allows big data engineers to shield out all kinds of pain points and combine the above various platform capabilities, so that big data development can be as simple as writing SQL.

The above content is how to design the architecture of big data platform. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.