A series of methods for optimizing Join operations (1) 07/06 Update SLTechnology News&Howtos

A series of methods for optimizing Join operations (1)

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

JOIN is a common operation in relational database, which is used to associate multiple tables. The association condition is to determine whether the values of an associated field are equal or not. With the increase of associated tables or the increasing complexity of association conditions, it can be said that JOIN is the most challenging SQL operation in terms of understanding the meaning of the query, implementing the query statement, or in terms of query performance.

In particular, the performance of JOIN has always been a difficult problem. Next we will provide some ways to improve computing performance based on data computing middleware (DCM)-aggregator.

Of course, we are not introducing how to write JOIN when writing SQL statements, that is, we assume that we have a correct understanding of the query requirements and can correctly implement SQL. In this case, in order to improve the performance, we must start with the most basic means of improving data IO coordination algorithm and parallelism. Because of this, if the data is still stored in the database, there is no good way to speed up, because the IO of the database is very inefficient and almost impossible to run in parallel, even if the calculation is written in a delicate way. Therefore, in order to improve performance, we must move the data out of the database, our following discussion is based on this idea, and the aggregator is the sharp weapon to achieve this idea, even the artifact!

It is easy to move the data table out of the database and store it in the set file of the aggregator with two lines of code:

one

= db.cursor ("select * from order Table")

two

= file ("Order.btx") .export @ b (A1)

These two lines of code export the data from the order table in the database to the set file Order.btx.

Because the database IO performance is poor, and the amount of data may be very large, so this "move" action may not be short, but fortunately it is an one-time. All of our later calculations will be taken from the set file.

1 to judge the type of JOIN

After moving the data out of the database, we need to determine the type of JOIN before we can take targeted optimization measures.

JOIN operation is familiar to everyone, and it is divided according to the syntax definition of SQL, including INNER JOIN (inner join), LEFT JOIN (left join), RIGHT JOIN (right join) and FULL JOIN (full join). This is divided according to the rules for handling null values in the operation. From the semantic point of view closer to the requirements, our analysis and optimization will be divided according to the situation in which the primary keys of each table participate in the association. Generally speaking, there are three kinds: foreign key table, same-dimensional table and master child table.

Foreign key table

When some fields of table An are associated with the primary key of table B, B is called the foreign key table of A, and the field associated with the primary key of table B in table An is called the foreign key of A to B. At this time, A table is also called fact table, and B table is also called dimension table.

Table A:Order order table

Order number

CustomerID

Customer number

SellerID

Sales number

OrderDate

Order date

Amount

Order amount

Table B:Customer customer table

Customer number

Name

Customer name

Area

Located in the area

Table C:seller sales personnel table

Employee number

Name

Age

……

This is a typical example, where the customer number of the order table is associated with the primary key customer number of the customer table, and A to B is a many-to-one relationship, that is, there may be multiple records in A table pointing to the same record in B table.

In this case, the value of the foreign key field ("CustomerID" in the example) can be understood as a "pointer" pointing to the corresponding record in the foreign key table, while the corresponding record in the foreign key table can be understood as an object, and the field of the foreign key table can be understood as the property of the object, and the function of the "pointer" is only used to find the corresponding record in the foreign key table. In the example, when you associate Table A with Table B, you must want to get details such as the name or location of the customer for some order, so that it will be easier to understand if you can write customerID.name and customerID.area, and this syntax is perfectly supported in the aggregator.

At the same time, Table A can also have multiple foreign key tables. For example, the sales number (SellerID) of Table A can point to a salesperson information table C to obtain the attribute information of the order salesperson.

Same dimensional table

The primary key of table An is associated with the primary key of table B, and An and B are called identical-dimensional tables. The same-dimensional table is an one-to-one relationship, such as JOIN, LEFT JOIN, and FULL JOIN, such as the employee table and the manager table.

Table A:employee employee table

Employee number

Name

Salary

Wages

Table B:manager customer table

Serial number

Allowance

Subsidy

……

The primary key of both tables is the employee number ID, which means that the manager is one of the employees, but because the manager has more attributes than the average employee, it needs to be saved with a separate manager table. Logically, this one-to-one situation can be treated simply as a table. In the same dimensional table JOIN, both tables are associated by the primary key, and the corresponding records are unique.

Master and child table

The primary key of table An is associated with part of the primary key of table B. An is called the primary table and B is called the child table. Master child table is an one-to-many relationship, only JOIN and LEFT JOIN, there will be no FULL JOIN, such as: order and order details.

Table A:Order order table

Order number

CustomerID

Customer number

OrderDate

Order date

……

Table B:OrderDetail order schedule

Order number

Order serial number

Product

Order products

Price

……

The primary key of Table An is ID, and the primary keys of Table B are ID and NO. A record in Table A corresponds to multiple records in Table B. At this point, the relevant record in the order schedule can be regarded as an attribute of a record in the order table, the value of this attribute is a collection, and it is often necessary to use aggregation operations to calculate the set value as a single value. For example, query the total amount of each order, which can be described as:

SELECT ID, SUM (OrderDetail.Price) FROM Order

Obviously, the master-child table relationship is not equal, and references from both directions make sense. The case of referencing a child table from a master table is to get a single value by aggregating, while referencing a master table from a child table is similar to a foreign key table.

So, what is the use of dividing three kinds of JOIN operations, foreign key table, same-dimensional table, and master child table? To optimize performance, of course! For the JOIN operation that needs to be optimized, the later optimization will be more effective on the basis of accurately determining which type it is. In addition, it is important to note that tables An and B mentioned here do not have to be an entity table, but may also be a "logical table" generated by a subquery.

Next we begin to analyze and speed up these three types as well as the actual business situation.

2 Foreign key table with full memory

If all the data involved in the operation can be loaded into memory, then the "foreign key pointing" technology can be used to optimize the foreign key JOIN operation.

2.1 single foreign key

Taking the above order table and customer table as an example, to query the customer name and region of each order:

We need to query the order number, user name, user level and time when the order was placed for all orders, as SQL says:

SELECT order number, user name, VIP level, order time FROM order table, user information table WHERE order table. User number = user information table. User number

The implementation with the aggregator is like this:

one

= file ("user Information Table") .import@b ()

two

= A1.keys (user number)

three

= A1.index ()

four

= file ("order form") .import@b ()

five

= A4.switch (user number, A3: user number)

six

= A5.new (order number, user number. User name: user name, user number .VIP level: user level, time when the order was placed)

A1, query user data from the set file

A2, set the key of the user information table to the user number

A3, indexing by user code field

A4, query order data from set file

A5, association, establishing a pointer to the user information table record on the user code field of the A4 order table

A6, after the foreign key is pointed, the foreign key table field is used as the user name and user-level attribute.

The actual effective operations are A5 and A6, and the rest are data preparation.

For another example, this time you need to calculate the total number of orders for users at each VIP level, as SQL says:

SELECT VIP level, count (order number) order number FROM order table, user information table WHERE order table. User number = user information table. User number GROUP BY VIP level

Using an aggregator goes like this:

one

= file ("user Information Table") .import@b ()

two

= A1.keys (user number)

three

= A1.index ()

four

= file ("order form") .import@b ()

five

= A4.switch (user number, A3: user number)

six

= A5.new (order number, user number. User name: user name, user number .VIP level: user level, time when the order was placed)

seven

= A5.groups (user number .VIP level; count (order number): number of orders)

Most of the processing steps of this calculation are the same as those of the previous example, except that A7 is executed after the calculation of the previous example to summarize the associated results. This can be done because the pointer association of foreign keys has been completed in the last calculation, and the results of A5 can be reused here. In practice, these two calculations are performed in a DFX file, so the whole process only needs to be associated once.

This is also another feature of the aggregator, which can reuse the intermediate calculation results so as to improve the overall query performance. The more times of reuse, the more obvious the performance optimization. This cannot be done in SQL. Two queries have to execute SQL statements twice, and each execution has to be associated, so the overall performance is naturally poor.

In addition, there may be multi-field foreign keys, where multiple fields of the fact table are associated with a dimension table, which is slightly more complex and will be described in more detail in a later chapter.

2.2 multiple foreign keys in one layer

Let's take a look at an example of multiple foreign keys. Suppose there are three tables in the database: order table, user information table and seller information table:

We need to summarize the number of orders by user level and seller's credit rating, as SQL says:

SELECT VIP level user level, credit rating seller level, count (order number) number of orders

FROM order table, user information table, seller information table

WHERE order form. User number = user information table. User number AND order form. Seller number = seller information table. Seller number

GROUP BY VIP grade, credit rating

Using an aggregator is like this:

one

= file ("order form") .import@b ()

two

= file ("user Information Table"). Import@b (). Keys (user number). Index ()

three

= file ("seller information table"). Import@b (). Keys (seller number). Index ()

four

= A1.switch (user number, A2: user number; seller number, A3: seller number)

five

= A4.groups (user number .VIP level: user level, seller number. Credit rating: seller's grade; sum (order number): number of orders)

A1, query order data in

A2, query user data in

A3, query seller data in

A4, one-time association between the user information table and the seller information table

A5, summarize the associated results.

2.3 Multi-layer foreign keys

There may also be multiple layers of foreign key tables. In the following example, suppose there are three tables in the database: order schedule, commodity information table, and category information table:

We want to query the total number of goods sold according to the name of the broad category of goods. SQL says:

SELECT category name, SUM (item number) number of goods

FROM order schedule, commodity information table, category information table

WHERE order schedule. Commodity number = commodity information table. Item number AND commodity information table. Category number = category information table. Category number

GROUP BY large class name

Using an aggregator is like this:

one

= file ("order schedule") .import@b ()

two

= file ("Commodity Information Table"). Import@b (). Keys (item number). Index ()

three

= file ("Category Information Table") .import@b () .keys (Category number) .index ()

four

= A2.switch@i (category number, A3: category number)

five

= A1.switch@i (item number, A4: item number)

six

= A5.groups (article number. Category number. Large category name: large category name; sum (quantity): number of goods)

A1, query order details

A2, query commodity information data

A3, query category information data

A4, establish the pointer to the category information record in A2 through the switch function to realize the association.

A5, through the switch function to establish a pointer to the commodity information record in A1, to achieve the association, so that a three-tier correlation is obtained.

A6, summarize the associated results.

Use the method of foreign key pointing to solve JOIN, and all foreign keys can be parsed by traversing the fact table once. On the other hand, the HASH JOIN algorithm of the database can parse only one JOIN each time it is executed (meaning traversing the data).

At the same time, if all memory is used, the pointer can be reused once it is established, and if we have to calculate again for this association, there is no need to establish the association. But the database SQL is not good, but also have to do HASH JOIN, even if it is based on the SQL system of in-memory database, in this respect is not fast.

2.4 parallel computing

After the foreign key is pointed, the performance can be further optimized in parallel. For the fact table that has been loaded into memory, we can access it in segments, thus significantly improving the performance of computing in the way of parallel computing.

Again, in the case of multiple foreign keys above, the number of orders is still summarized according to the user level and the seller's credit rating. Let's see how to use the aggregator to access the fact table in parallel:

one

= file ("order form") .import@b ()

two

= A1.cursor@m (4)

three

= file ("user Information Table"). Import@b (). Keys (user number). Index ()

four

= file ("seller information table"). Import@b (). Keys (seller number). Index ()

five

= A2.switch (user number, A3: user number; seller number, A4: seller number)

six

= A5.groups (user number .VIP level: user level, seller number. Credit rating: seller's grade; sum (order number): number of orders)

This code is basically the same as the previous one, except that the access to the fact table of the order table is different. A2 uses the @ m option to divide the order table data into four segments, resulting in a multiplex cursor, based on which subsequent calculations are performed. The groups function determines automatically, and if A2 is a multipath cursor, it will be calculated automatically. Here is divided into 4 cursors, it is assumed that the number of CPU is 4, the actual use can be based on the number of CPU to determine how many segments.

Parallelism has an obvious effect on a single task, mainly because it makes full use of CPU resources, and it doesn't make much sense if there is not too much free CPU available when there is more concurrency.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.