How to implement a SQL generator for connecting tables across libraries 07/03 Update SLTechnology News&Howtos

How to implement a SQL generator for connecting tables across libraries

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to achieve a cross-database linked table SQL generator, the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

An overview

The main purpose of the ADC (Alibaba DChain Data Converger) project is to make a set of tools. After simply configuring the indicators at the front end, the user can query the real-time data he needs in the large and wide table automatically generated by the system. The data source supports cross-database and multiple target media. At a higher level, the global real-time visualization of data itself is an effective measure to solve the "dragon effect" of supply chain data. ADC is also done for this goal, the entire ADC system

Architecture parsing:

The initial data comes from the metadata center.

The dispatch center transmits the data in internal format to the planning center, which analyzes the data requirements and models, generates resources and SQL through SQL generator, and sets monitoring standards and reconciliation standards through alarm center and reconciliation center, respectively.

The reconciliation center reconciles the accounts regularly to check the alignment of the data.

The alarm center can send an alarm in response to task errors and high delays.

Lifecycle control of resources under the Resource Management Center, the Resource Management Center is responsible for recycling resources when the view is deleted.

The basic resource adaptation layer mainly connects all kinds of Ali data services with the group's basic resource management capabilities, such as Aliyun MaxComputer, Flink, Aliyun AnalyticDB and so on.

Among them, the upstream and downstream of the SQL generator mainly involve:

Upstream planning center

Configuration metrics: the user configures what data he wants to see at the front end.

Production raw data: which tables are used as data sources based on user input, and the connections between them.

Downstream basic resource adaptation layer adapter

Publish the SQL to Flink and build the physical table according to the table data.

This paper mainly introduces the related content of SQL generator from the technical point of view.

Second, technical realization

In the implementation phase of the project, the work needs to be carried out from the steps of requirement analysis, technical scheme design and test joint adjustment. This paper does not focus on the software development process, but focuses on the selection of design patterns and data structure algorithm design.

Demand analysis

In the requirements analysis phase, we identified the demand points that need to be considered in the automatic generation of SQL modules, including the following points:

Multiple fact tables (flow tables) and multiple dimension tables need to be supported, one of which is the main table and the others are auxiliary tables.

Changes in dimension tables should also lead to final database updates.

The primary table is 1:1 or N1 for the secondary table, which means that the granularity of the primary table is the finest, and the secondary table is connected to the primary table through a unique key.

There may be multiple flow tables with the same unique key in the flow table, which need to be associated through a full join. Tables with different unique keys are associated through a left join.

Only concatenated tables and UDF, no groupby operations.

The synchronization delay is required to be small, and multiple sources and target media are supported. Because the query pressure is on the target media, the query qps is not required.

System flow chart

After defining the requirements, we divide the overall functionality of the SQL generator into two parts:

Generate SQL and build table data synchronously

Asynchronously publish SQL and build tables

The reason for synchronizing the SQL generation phase is that the synchronization phase is dominated by memory operations. If you find that there is something wrong with the data, you can quickly fail to generate SQL. It takes a long time to synchronize and wait for the basic resource adaptation layer to be called in the release phase. For each release step to have a stateful record, you can roll back or try again. So implement it asynchronously. The overall function of the synchronization phase of the SQL generator is refined into small modules, as shown in the following figure:

Inspection phase

Check if there is a problem with the original data, and fail to generate SQL quickly.

Parameter check: check whether the upstream provides basic parameters, such as fact table information (there can be no dimension table, but there must be a fact table).

Table type check: check whether the data source type is supported.

Partition field check: whether the wide table partition field is provided.

Join constraints: check whether the connection information of the flow table and dimension table is correct.

Master table uniqueness constraint: check whether the main table contains connection information and whether the unique key has ETL information.

Metadata check: check to see if HBase configuration information is included.

Primary key correction: fix the dimension table join key, which must be the only key of the dimension table.

Data synchronization

Synchronize all the original table and the join data of the original table (for example, the source table is synchronized in to generate the 1:1 HBase table).

Generate priority queues: generate execution priorities for tasks such as connections and releases.

Synchronization padding: populate the HBase table data of the synchronization phase corresponding to the source table, and the corresponding configuration items, type conversion (for example, the source table is a MySQL table, and the field type is to be converted to the type of HBase), ETL padding, add message queues (notify downstream nodes to run by sending messages).

Duplicate column pruning: deletes duplicate columns.

Blank column marking: the columns that meet certain conditions (for example, do not need to be displayed in the wide table, not unique key columns, join key columns, protect sequences) are marked with white columns above.

Fill in the order-preserving field: if a field indicating the creation time of the data is provided upstream, this field is used as the data-preserving field, and if not, the time when the data is received by the system is filled as the order-preserving field.

Calculation phase

Generate a wide table and populate the SQL.

Intermediate table fill: populate the intermediate table generated by a full connection.

Reverse index padding: populates the reverse index information.

Message filling: add message queues to the intermediate table (updates to the intermediate table can trigger downstream nodes).

Connection chain alignment: middle table and wide table connection key alignment.

ETL padding: populates the ETL information of a large and wide table column.

Partition field fill: populate the wide table partition field.

SQL padding: populate the Flink synchronous table mapping SQL statement, the Flink calculation SQL statement, and the Flink result table mapping SQL statement.

Save: the SQL and table-building data are stored in the database, and subsequent requests can reuse the existing data to avoid repeated table creation.

The asynchronous release phase publishes SQL statements to Flink.

Reasons for adding reverse indexes

If there are two tables An and B join, then the connection mode is the non-primary key of A table to connect the primary key of B table. In terms of timing, there may be the following three situations:

The data of form B is generated many days before the data of form A.

The data of form B is generated after the data of form A for many days.

Table B data and table A data are generated at the same time

Let's analyze these three situations one by one.

Scenario 1 the data of table B is generated many days before the data of table A

If table B data is stored in a database that supports high qps, we can directly connect table A to this table (dimension table) when the data arrives.

Scenario 2 the data of table B is generated after the data of table An is generated for many days.

This kind of scene is troublesome. Table A data is generated first, so it falls into the database too early, resulting in the arrival of B-form data, even if the B-dimensional table is connected to get the data. There is a similar scenario in this scenario: if B is updated after the AB connection is completed, how can the update of B be reflected in the wide table?

To solve this problem, we have added a "reverse index table". If the primary key of An is id and the connection key is ext_id, then we can store the values of ext_id and id in a table. When the data of B is updated, we use the primary key of B to connect the ext_id field of this table, pull all the id fields of A table, and flow the id field of A table back into Flink.

Three design patterns

After understanding the overall process of the system, let's take a look at the design pattern selection of the system. When choosing the design pattern, we consider that there are some commonalities in the development work related to data processing:

After disassembly, there are many functions.

There is reuse of small functions.

There is a strict order in which small functions are performed.

The running status of small functions needs to be recorded, and the process execution can be rolled back or interrupted to resume execution.

Because the step of the data processing task is lengthy, and because the result of each stage is related to the execution of the next stage, it can not be separated.

Referring to the PipeLine (pipeline) design pattern [2], after comprehensive consideration, the overall design of our system is as follows:

First of all, there is a global PipeLineContainer that manages multiple pipeLine and pipeline context, and each pipeline can perform a task independently, such as pipeline1 performing synchronous generation sql tasks. Pipeline2 performs asynchronous publishing tasks. Publishing must be performed after the generation of the SQL, and the pipeline is stateful and concatenated in a certain order. Each pipeline contains multiple reusable valve (functions). Valve can be reused and combined arbitrarily to facilitate the completion of more data processing tasks (for example, if you want to support Tisplus dump platform access in the future, you can simply splice the existing valve).

4. Explanation of data structure and algorithm

The key point of the SQL generator is to express the relationship between the tables (Meta nodes). The relationship between Meta can be divided into two types, namely full join association and left join association (because left join association involves data timing, it is more complicated to add reverse index, so it is different from full join. In order to simplify the problem, we first perform full join, and then perform left join).

The problem we want to solve is that after multiple data sources synchronize the data in, it will be associated according to a certain priority, and finally get a wide table and need to be published automatically. Abstracting to the data structure level is:

Each synchronized data source corresponds to a leaf node

There are many kinds of connection relationships between nodes, and there are execution priorities.

All nodes and relationships form a tree

Finally get a root node (width table) and publish

Algorithm idea

The algorithm to solve this problem is explained below.

Priority queue

Because the execution priority of the connection between the leaf nodes is different, it is put into the priority queue first. After that, high-priority tasks are taken out for execution each time. Tasks with the same priority can be reused and executed several times in a row. The schematic diagram of the priority queue is as follows:

Build a tree

The construction consists of the following steps:

1. First, we get four priority tasks, from high to low:

Priority 1, synchronization tasks for six nodes

Priority 2, Full Join tasks for nodes 1, 2, 3, and nodes 4, 5

Priority 4, publish task

two。 The task with priority 1 is executed, and six data sources are synchronized with six leaves.

3. Take the task of priority 2 and execute it to get the middle table 1 # 2.

4. Take the task of priority 3 and execute it, and find that nodes 1 and 4 have parents, then execute intermediate nodes 1, 2 and node 6 Left Join respectively to get the root node.

5. Take the task of priority 4 and execute it, publish the root node.

You can see that the final data structure is a tree, and in this way we can support the automatic construction of complex sql. Further abstracting, this "one queue-driven one tree generation" model can solve a class of problems:

The solution of the problem consists of a series of tasks with different priorities, which need to be reused.

The task relationship tree is constructed by taking high-priority tasks from the queue.

Finally, traverse the tree to complete each node task.

Limited to space, the focus is to introduce the main data structures and design patterns used in the development of automatic generation sql functions.

At present, we have realized the automatic generation and release of any table association sql, and the overall delay is controlled within 2 seconds. After that, the SQL generator will be launched to facilitate access to more third-party real-time computing platforms (such as Tisplus) and reduce the overall system latency. The main test of convenient access is the design of the architecture, and it is also the focus of this paper (including the design of data structure and algorithm, the choice of design pattern). Reducing system delay includes message middleware optimization, code execution efficiency improvement and so on.

On how to achieve a cross-library linked table SQL generator to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.