A series of methods for optimizing Join operations (2) 07/04 Update SLTechnology News&Howtos

A series of methods for optimizing Join operations (2)

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

3 foreign key table in semi-memory

The premise of foreign key pointing is that both the fact table and the dimension table can be loaded into memory, but the amount of data involved in the actual business may be very large, so this method cannot be used.

3.1 dimensional table memory

If only the fact table is large and the dimension table can still be loaded into memory, you can still use the foreign key pointing method above, as long as you modify the access to the fact table and use the cursor to fetch batches from the set file for processing. However, because this pointer is temporarily created when the cursor is fetched, it is not possible to reuse established pointers as they are in full memory.

We still summarize the number of orders according to the user level and the seller's credit rating, and the order table is too large to import into memory, so use the aggregator to achieve the following:

one

= file ("order form")

two

= A1.cursor@b ()

three

= file ("user Information Table"). Import@b (). Keys (user number). Index ()

four

= file ("seller information table"). Import@b (). Keys (seller number). Index ()

five

= A2.switch (user number, A3: user number; seller number, A4: seller number)

six

= A5.groups (user number .VIP level: user level, seller number. Credit rating: seller's grade; sum (order number): number of orders)

The principle of this implementation is the same as that of foreign key pointing, except that the data of the order table is not imported into memory at once, but is accessed through a cursor. Because fact tables continue to grow, it is common in real business that fact tables are large and dimension tables are small.

This is an example of multiple foreign keys. The situation of multi-layer foreign keys is similar to that of single-layer foreign keys, except that when a foreign key table is memorized, the foreign key table of that table must also be stored, so that the foreign key pointer of memory can be established in advance. Temporary cursor-based foreign key associations are only for the lowest level of foreign key tables.

Cursors can also implement parallel computing, as long as the above code is changed to this:

one

= file ("order form") .cursor@bm (; 4)

two

= file ("user Information Table"). Import@b (). Keys (user number). Index ()

three

= file ("seller information table"). Import@b (). Keys (seller number). Index ()

four

= A1.switch (user number, A2: user number; seller number, A3: seller number)

five

= A4.groups (user number .VIP level: user level, seller number. Credit rating: seller's grade; sum (order number): number of orders)

The order table data from the set file is divided into four cursors, and the groups function will be calculated in parallel when it is executed. The reason why we can do segmented fetch here is that the data has been exported to the set file, which cannot be done if the data is still in the database, which is one of the reasons why we export the data to the set file.

What if the dimension table is too large to load memory? This situation requires the use of clustering or optimized out-of-memory HASH JOIN technology, which we will explain in more detail in the following chapters.

3.2 Foreign key numbering

The idea of foreign key numbering is that if the primary key of the dimension table is a natural number starting from 1, then you can directly locate the dimension table record with the serial number, instead of calculating and comparing the hash value. This can be seen as the realization of foreign key pointing in external memory, thus further improving performance. According to the idea of foreign key numbering, the association between the previous order table and the user table can be changed to this:

one

= file ("user Information Table") .import@b ()

two

= file ("order form") .cursor@b ()

three

= A2.switch (subscriber number, A1VR #)

four

= A3.new (order number, user number. User name: user name, user number .VIP level: user level, time when the order was placed)

A1, import all customer tables into memory

A2, import the order table using cursors

A3, take the value of the user number as the serial number in the A2 order table, and use this serial number to find the corresponding record in the user information table and establish an association.

A4, through the attribution of the foreign key, the foreign key table field is used as the user name and user level attribute.

3.3 Serial preparation

However, the primary key of the dimension table is not necessarily an ordinal value, so you cannot directly use foreign key serialization for performance optimization. At this point, you can convert the primary key of the dimension table to a serial number and then use a foreign key to serialize it. The steps for processing are as follows:

1) create a new key value-sequence number corresponding table, and save the corresponding relationship between the key value of the dimension table and the natural sequence number.

2) replace the key value of the dimension table with the natural serial number to get a new dimension table file.

3) change the foreign key value in the fact table to the sequence number, which is based on the corresponding table of key value and sequence number, and get a new fact table after modification.

In this way, you get the new dimension table and fact table files, and the old table files can be deleted.

If new data is added to the dimension table, follow these steps:

1) append the corresponding table of key value and sequence number first

2) append the new data to the new dimension table, and correspond to the table according to the key value-serial number.

3) finally, append the fact table, which corresponds to the table according to the key value-serial number.

When you have completed the serialization of foreign keys, you can use the way of foreign key serialization to improve performance. The method of serialization is suitable for situations where the dimension table is basically unchanged, and the fact table data can be constantly appended.

The following is still an example of an order table and a user information table to illustrate a specific implementation of serialization:

1) create a new key value-sequence number corresponding table of the user information table, save it to the set file, and generate a user information table file at the same time.

one

= db.query ("select *, 0 AS NEW_ID from user Information Table order by user number")

two

= A1.run (#: NEW_ID)

three

= file ("OldKey_NewID") .export @ b (A2, user number, NEW_ID)

four

= file ("user information table") .export @ b (A2, NEW_ID: user number, user name, contact phone, VIP level)

A1 takes all the fields from the user information table of the database and adds a field NEW_ID to save the serial number

A2 assigns NEW_ID to a natural number starting at 1

A3 saves the old user number and serial number to the set file

A4 uses the value of the NEW_ID field as the value of the user number field, and saves the data to the user information table file without changing the other fields.

2) according to the order table, get the new order table.

one

= file ("OldKey_NewID") .import@b ()

two

= db.cursor ("select * from order Table")

three

= A2.switch (user number, A1: user number)

four

= A3.run (user number .new _ ID: user number)

five

= file ("New order form") .export @ ba (A4)

A1 imports the correspondence table into memory

A2 uses cursors to extract data from the order table

A3 replaces the user number field in the order table according to the corresponding table

A4 converts the value of the replaced user number field (the value of the user number field obtained by A3 is a record type, so it is converted to a field in A4)

A5 Export the cursor data to a new order table file (in practice, you may have to export it multiple times)

Through these two steps, we can complete the serialization of the existing data in the database, and export to the user information table and order table. At the same time, we also get a key value-serial number corresponding table file, named OldKey_NewID.

As mentioned earlier, serialization applies to situations where the dimension table data is basically unchanged, and if the dimension table changes, you need to recreate the data before using serialization. However, if you can clearly know the newly added data on the fact table and dimension table (for example, through conditions such as time), you can also do so in the following ways.

1) append user information table and key value-sequence number corresponding file first.

one

= db.query ("select *, 0 AS NEW_ID from user information table where registration time > '2018-01-01' order by user number")

two

= file ("user Information Table") .cursor@b () .skip ()

three

= A1.run (A2+#:NEW_ID)

four

= file ("OldKey_NewID") .export @ ab (A3, user number, NEW_ID)

five

= file ("user information table") .export @ ab (A3, NEW_ID: user number, user name, contact phone, VIP level)

A1 gets the new data to be appended to the user information table. Here is the newly registered user data from the database since 2018.

A2 gets the number of records in the user information table

A3 fill in the NEW_ ID value in the new data and continue counting from A2

A4 appends the user number and serial number to the file corresponding to the key value-serial number

A5 appends new data to the user information table file.

3) additional order form

one

= db.query ("select * from order form where order time > = '2018-01-01' order by order number")

two

= file ("OldKey_NewID") .cursor@b ()

three

= A1.switch (user number, A2: user number)

four

= A3.run (user number .new _ ID: user number)

five

= file ("order form") .export @ ba (A4)

A1 gets the cursor of the new data to be appended to the order table, here is to take the order since 2018 from the database as the new data

A2 is the corresponding table that gets the serial number of the key.

A3 replaces the user number field in the new data cursor according to the corresponding table

A4 converts the value of the replaced user number field

A5 uses a loop to take the number from the cursor and append it to the order table file, which is similar to the append of the user information table.

The above is an example of single foreign key numbering, and the same is true for multiple foreign keys, except that there are multiple dimensional tables to deal with. If it is a multi-layer foreign key, then the upper layer does not need to be numbered, as long as the lowest dimension table is numbered, because the upper layer has been indexed in full memory.

The essence of the serial processing of foreign keys is to optimize the method of finding foreign keys, and use the values of foreign keys as serial numbers to find records in the table directly, so the data that has been sequenced by foreign keys can still be calculated in parallel, in the same way as mentioned earlier. I won't go into detail here.

4 same dimension table and master child table

Here we analyze the two cases of the same dimension table and the master and child table together, because the speed-up means of these two cases are the same, that is, orderly merging.

4.1 orderly merger

Let's look at the simple situation first. if both tables are already ordered for the association key, then the merge algorithm can be used directly to deal with the association. Let's look at an example.

Order form

Order number

User number

Seller number

Date of order issued

Order schedule

Order number

Commodity number

Quantity

Amount

Seller information table

Seller number

Name

……

User information table

User number

User name

……

At this time, the order table is the main table, and the order schedule is the child table, which is a typical one-to-many situation. Now, if you want to query the order and its details, associate the two tables according to the order number field. First, let's take a look at the example when the amount of data is small. The goal of the calculation is to summarize the sales of each seller:

one

= file ("order form") .import@b ()

two

= file ("order schedule") .import@b ()

three

= join@m (A1: order, order number; A2: details, order number)

four

= A3.groups (order. Seller number: seller number; sum (details. Amount): total sales)

A1 imports all the order tables into memory.

A2 imports all the order schedule into memory.

A3 associates the two tables by order number through an ordered merge algorithm (@ m option).

A4 summarizes the results of join in groups.

The result of the join operation of the aggregator is different from that of SQL, the result of join in SQL is the field of two tables, while the result of aggregator join is to take the records of the two tables as the result field, so the syntax when doing groups needs to be written as "field. sub-field" (similar to "object. Attribute"), for example, visiting the seller number is written as "order. Seller number".

If the data is too large to be imported into memory, you can use cursors for orderly merging.

one

= file ("order form") .cursor@b ()

two

= file (order schedule). Cursor@b ()

three

= joinx (A1: order, order number; A2: details, order number)

four

= A3.groups (order. Seller number: seller number; sum (details. Amount): total sales)

Note that the premise of an orderly merge here is that the order table and order schedule are already ordered for the order number field.

A1 imports the order table through a cursor

A2 imports the order schedule via cursor

A3 associates two cursors according to the order number through the ordered merging algorithm

A4 summarizes the results of joinx in groups. Similarly, the result field of joinx is also a record, so the access syntax to the seller number in groups becomes the order. The seller's number and the access syntax to the amount becomes the details. The amount.

Orderly merging can also be used with cursor foreign keys, for example, we want to calculate user names with a total consumption amount greater than 1000:

one

= file ("order form") .cursor@b ()

two

= file (order schedule). Cursor@b ()

three

= file (user Information Table). Import@b ()

four

= A1.switch@i (user number, A3: user number)

five

= joinx (A4: order, order number; A2: details, order number)

six

= A5.groups (order. User number. User name; sum (details. Amount): total) .select (total > 1000)

A1 imports the order table through a cursor

A2 imports the order schedule via cursor

A3 imports the user information table into memory

A4 uses the user number field and the user information table for foreign key association

A5 associates two cursors according to the order number through the ordered merging algorithm

A 6 through the user name field (order. User number. The user name (user name) is summarized in groups, and those with a total amount of more than 1000 are selected.

4.2 data preparation for orderly merger

But what if the data is not ordered by the primary key in advance? Then you need to sort it in advance. The same dimension table and the master child table can be sorted in the data preparation stage, because for the association of the same dimension table or the master child table, which field is used, that is, the primary key (part); for the foreign key table, the fact table may have to be associated with multiple dimension tables, and the associated fields may be different each time, and it is impossible for a table to be orderly for all foreign keys at the same time.

Therefore, for the original data in the database that does not guarantee the order, we can sort the data externally at the same time. This section describes how to sort and how to update data in an orderly manner after sorting.

Let's first look at the export of the original data. If the data source of the same-dimensional table or primary child table that you want to sort is a database, sort with the database. If the data source is not a database, you can use the aggregator's sortx function for sorting. After sorting, use the export function to save to a new file. If you want to use segmented parallelism, also note that you add the option @ z when exporting. The process goes like this:

one

= db.query ("select * from order form order by order number") .cursor ()

two

= file ("order form") .export @ z (A1; order number)

three

= db.query ("select * from order schedule order by order number". Cursor ()

four

= file ("order details") .export @ z (A1; order number)

A1, import the order table from the database through a cursor and sort it

A2, write the sorted cursor data to the set file

A3 and A4 also sort the order schedule of the database and write it to the set file.

Let's take a look at what to do if new data is appended to these two tables, let's just take the addition of the order table as an example:

one

= file (order form). Cursor @ b ()

two

= db.query ("select * from order form where order date > = '2018-01-01' order by order number") .cursor ()

three

= [A1 ~ A2] .mergex (order number)

four

= file ("New order form") .export @ z (A1; order number)

A1, import the order table through a cursor

A2 to extract new data generated since 2018 from the database

A3, the two cursors are sequentially merged according to the order number field

A4, write the merged cursor data to a new file.

In the subsequent use, the old order table file is replaced with the new file, so that the orderly merging of the new data and historical data is completed, and can be processed in an orderly manner.

The mixing of new data and historical data is an orderly merging process, which does not need to be reordered, just read and write the data again, and the time cost is not high.

4.3 parallel computing

If the amount of data is really large and the cost of frequent rewriting is too high, you can rewrite all the data every other relatively appropriate cycle, save the data to a smaller file before the cycle point, and merge the small file and the whole history file at the cycle node, and the specific period is set according to the actual business. In this way, there will be two files: the full history file and the small file in the cycle. You can use multiway cursors to access the two files together.

For example, you can plan to rewrite all the data every other month, and the additional data every day can be merged into a small file of the current month, and only this small file and the current day's data will be merged in the middle of the month. At the end of the month, all the files and history files are merged, so that the number of full merging can be reduced and the total processing time can be reduced. In this way, the two files are the historical file and the current month file.

Of course, you can also keep the files from each month as historical data, and then use multiway cursors to access multiple sets of data, which may be better. This is the case with the date as an example, and the segmentation scheme can be designed according to other fields, such as by region, etc.

The following is illustrated by the method of keeping one file each month. First, the processing of the newly generated data on the same day is realized, and the order table is still taken as an example:

one

= file ("order form August") .import@b () .cursor ()

two

= db.query ("select * from order form where order date > = '2018-08 house XX' order by order number". Cursor ()

three

= [A1 ~ A2] .mergex (order number)

four

= file ("New order form August") .export @ z (A1; order number)

A1, import the order form month file for August through a cursor

A2 to extract new data generated since a certain day in August 2018 from the database

A3, the two cursors are sequentially merged according to the order number field

A4, write the merged cursor data to the new August file.

After processing, you can get the order table set file for each month, and in the same way, you can also get the set file for each month's order schedule. The two set files (order and details) of each month are generated according to the order time, and the corresponding master and child table records (order and its corresponding order details) are in the file of the same month. In this way, we can merge the data of each month in an orderly manner to achieve the join of the master and child tables, and further speed up. Taking the statistics of the total sales of sellers as an example, the following is the specific implementation:

one

= 12. (file (order form / ~ / month) .cursor@b ()

two

= 12. (file (order schedule / ~ / month) .cursor@b ()

three

= 12. (joinx (A1 (#): order, order number; A2 (#): details, order number))

four

= A3.mcursor ()

five

= A4.groups (order. Seller number: seller number; sum (details. Amount): total sales)

A1, create an order table cursor for 12 months

A2, create order schedule cursors for 12 months

A3, using joinx to merge 12-month data to get cursors

A4, merged into multiplex cursors

A5, the multipath cursors are grouped and summarized.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.