In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
The master sub-table is one of the most common relationships in the database, the most typical include contracts and terms, order and order details, insurance policies and policy details, bank accounts and account pipelining, e-commerce users and orders, telecommunications accounts and billing lists or traffic details. When the amount of data in the master and child table is large, the performance of association computing will decrease sharply, which will seriously affect the user experience while increasing the server load. As a process-oriented structured data computing language, aggregator SPL can significantly improve the performance of large master sub-table association computing through orderly merging.
I. principle explanation
The so-called master-child table association calculation is to find the corresponding batch of records in the child table according to the associated field for each record of the main table. Take the order (main table) and the order detail (child table) as examples, with the order ID as the associated field. The following figure shows the processing of a record in the main table during association calculation. The red arrow indicates that the corresponding record has not been found (not associative), and the green arrow indicates that the corresponding record has been found (associable):
Suppose that the order (main table) has m records and the order details (child table) has n records. Without considering the optimization algorithm, the association of each record in the main table needs to traverse the sub-table, and the corresponding time complexity is O (n). The main table has a total of m records, so the complexity of the whole calculation is O (m records), which is obviously too high. Although the database is generally optimized by the hash scheme, when the amount of data is large or more tables are associated, there is still a temporary difficulty in parallelism and the use of external memory to cache data, and the performance will still deteriorate sharply.
As for the aggregator, for the large master sub-table association algorithm, significant optimization can be achieved through two steps: data ordering and merge association.
L data ordering
First of all, the main table and child table are sorted by the associated fields to form ordered data.
L merge association
First, the pointer is used to point to the first record on the main table and the child table respectively, and then the comparison is started. for the first record of the main table, if the child table encounters a matching record, it means that the pointer of the child table can be associated, and the pointer of the subsequent child table can be moved forward; if a mismatched record is encountered, it means that the correlation calculation of the first record of the main table is completed, and the pointer of the child table is fixed, and the pointer of the main table is moved down one bit to point to the second record. And so on.
After optimization, the association calculation of a single record can be illustrated by the following figure:
As you can see, after optimization, the association of a single record in the main table only needs to compare part of the data, no longer need to traverse the child table. In fact, the association of all records in the main table traverses the child table once, that is, the complexity is O (n). Plus the main table itself is traversed once, so the complexity of the whole calculation is O (masking n).
In this way, after the aggregator optimization, the time complexity of the algorithm becomes linear, and there is no need to generate landing intermediate data, so the performance is naturally greatly improved.
Of course, it should be noted that ordering itself can be time-consuming, so this optimization method is not suitable for association algorithms that are done only once. However, in the actual business, the association algorithm is usually executed repeatedly, and the cost of ordering is one-time and can be ignored.
II. Concrete realization
The following is to take the order and order details as an example to illustrate how the aggregator optimizes the association of master subtables.
Start by ordering the data (note that this is an one-time action). The aggregator script "data ordering .dfx" is as follows:
A
B
one
= connect ("orcl")
two
= A1.cursor ("select order ID, customer ID, order date from order order by order ID")
= A1.cursor ("select order ID, product ID, unit price, quantity from order detail order by order ID, product ID")
three
= file ("order .ctx") .create (# order ID, customer ID, order date)
= file ("order details .ctx") .create (# order ID,# product ID, unit price, quantity)
four
= A3.append (A2)
= B3.append (B2)
five
= A1.close ()
A1 connects to the Oracle data source, A5 closes the data source. The aggregator can connect to most commonly used data sources, including databases, Excel, Ali Cloud, SAP, and so on.
A2, B2: use SQL statement to take order and order details respectively, and sort by related fields. Because of the large amount of data, it can not be read into memory at once, so the cursor function cursor is used here.
A3, B3: create group table files "order .ctx" and "order details .ctx" respectively, which are used to store ordered data. You need to specify the field name here, where the field with the # sign is the primary key. The data is sorted by the primary key, and the value of the primary key cannot be repeated.
A4-B4: appends cursors to the group table file.
Second, for association algorithms that are usually executed repeatedly, you can use the aggregator script merge Association .dfx to implement the following:
A
B
one
= file ("order .ctx") .create () .cursor (order ID)
= file ("order details .ctx") .create () .cursor (order ID, quantity)
two
= joinx (A1: main table, order ID; B1: child table, order ID)
three
= A2.groups (; sum (child table. Quantity))
A1, B1: read the group table files "order .ctx" and "order details .ctx". Note that the group table defaults to column storage, so you only need to read in the fields needed for subsequent calculations, thus significantly reducing the Icano.
A2: merge and associate ordered cursors A1 and B1, where "main table" and "child table" are aliases to facilitate subsequent references. If aliases are omitted, subsequent references can be made through default aliases of _ 1 and _ 2. Note that the function joinx is associated internally by default, with the option @ 1 to specify the left association, or @ f to specify the full association. If there are multiple cursors to be associated with A1, they can be separated by semicolons.
A3: follow-up calculation of the association results, such as summarizing the number of products. In fact, subsequent calculations can support any algorithm, and it is beyond the scope of this article.
The above describes how to write the SPL script of the aggregator, but in the actual execution, you also need to deploy the running environment of the aggregator. There are two deployment options: embedded deployment and stand-alone deployment.
L embedded deployment
In embedded deployment, the use of aggregators is similar to that of embedded databases, and applications use aggregator drivers (JDBC) to execute aggregator scripts under the same JVM.
Here is the code for Java to call merge Association .dfx
1. Com.esproc.jdbc.InternalConnection con=null
2. Try {
3. Class.forName ("com.esproc.jdbc.InternalDriver")
4. Con = (com.esproc.jdbc.InternalConnection) DriverManager.getConnection ("jdbc:esproc:local://")
5. ResultSet rs = con.executeQuery ("call merge Association ()")
6.} catch (SQLException e) {
7. Out.println (e)
8.} finally {
9. If (contextual null) con.close ()
10.}
In the above JAVA code, the aggregator script is saved as a file, and the call syntax is similar to a stored procedure. If the script is simple, you can write the expression directly without saving the script file, and the calling syntax is similar to SQL, and line 5 can be written as follows:
ResultSet rs = con.executeQuery ("= joinx (file (\" order .ctx\ ") .create () .cursor (order ID), order ID; file (\" order details .ctx\ ") .create () .cursor (order ID, quantity), order ID) .groups (; sum (_ 2. Quantity)")
This article describes in detail the process of JAVA calling the aggregator: http://doc.raqsoft.com.cn/esproc/tutorial/bjavady.html
In addition to using Java code, you can also access the aggregator through the report, and then follow the method of accessing the general database. For more information, please refer to "making the Birt report script data source simple and powerful".
The script "data ordering .dfx" can be executed in the same way. However, this script is usually executed only once, so it can also be executed directly on the command line. The use of windows is as follows:
D:\ raqsoft64\ esProc\ bin > esprocx data ordering. Dfx
The usage under Linux is similar. You can refer to http://doc.raqsoft.com.cn/esproc/tutorial/minglinghang.html.
L standalone deployment
In stand-alone deployment, the aggregator is used similar to a remote database, and the application system can use the aggregator driver (JDBC or ODBC driver) to access the centralized server. In this case, the application system and the aggregator server are usually deployed on different machines.
For example, if the IP address of the aggregate server is 192.168.0.2 and the port number is 8281, then the JAVA application system can be accessed by the following code:
St = con.createStatement ()
St.executeQuery ("= callx (\" merge association .dfx\ "; [\" 192.168.0.2 8281\ "])")
For more information on the deployment and use of centralized computing servers, please refer to http://doc.raqsoft.com.cn/esproc/tutorial/fuwuqi.html
For JDBC and ODBC-driven deployment methods, please refer to
Http://doc.raqsoft.com.cn/esproc/tutorial/jdbcbushu.html
Http://doc.raqsoft.com.cn/esproc/tutorial/odbcbushu.html
III. Multithreading optimization
The previous introduction of the basic optimization ideas and implementation methods, that is, for the optimization of the data itself. In reality, the servers are multi-core CPU, so we can further optimize the above algorithms with multi-thread.
The principle of multithreading optimization is that the main table and child table are divided into N segments, and N threads are used for associated calculation at the same time.
Although the principle is simple, when it is really implemented, you will find many difficult problems:
L segmented efficiency
If you want to divide the data into N segments, you must first find the starting line number of each segment. If you count the line numbers in a stupid way of traversing, it will obviously consume a lot of Icord O resources in vain.
L data span
In theory, child table records with the same value of associated fields should be assigned to the same section. If the child table is segmented at will, it is likely to form cross-segment data.
L segment alignment
Further, in theory, the paragraph I data of the subtable should be aligned with the paragraph I data of the main table, that is, the range of associated field values of the master subtable should be the same. If the two are segmented independently, it may make it difficult to align the segmented data.
L secondary calculation
If subsequent calculations do not involve aggregation, for example, just filtering, then you only need to merge the results of the N threads directly. However, if subsequent calculations involve aggregation, such as sum or group summarization, then the second calculation aggregation should be done separately.
Fortunately, the aggregator has fully solved the above problems, segmentation will not consume IO resources, records with the same associated field values will be divided into the same segment, child tables and main tables will be aligned, and various secondary calculations do not need to be implemented separately.
Specifically, first of all, the data ordering script needs to be modified as follows (the red font is the modified part):
A
B
one
= connect ("orcl")
two
= A1.cursor ("select order ID, customer ID, order date from order order by order ID")
= A1.cursor ("select order ID, product ID, unit price, quantity from order detail order by order ID, product ID")
three
= file ("order multithreaded .ctx") .create (# order ID, customer ID, order date)
= file ("order details multithreaded .ctx") .create (# order ID,# product ID, unit price, quantity; # order ID)
four
= A3.append (A2)
= B3.append (B2)
five
= A1.close ()
B3: when generating order detail multithreaded .ctx, the data is segmented by "# order ID". This will ensure that the order ID has the same record and will be assigned to the same section in the future.
The script associated with the merge needs to be modified as follows:
A
B
one
= file ("order multithreaded .ctx") .create () .cursor@m (order ID)
= file ("order details multithreaded .ctx") .create () .cursor@m (order ID, quantity;; A1)
two
= joinx (A1: main table, order ID; B1: child table, order ID)
three
= A2.groups (; sum (child table. Quantity))
A _ 1 _. The number of threads is the default value, which is determined by the system parameter "maximum number of parallelism", or can be modified manually. For example, if you want to generate a 4-thread cursor, A1 should be written as follows:
= file ("order multithreaded .ctx") .create () .cursor@m (order ID;; 4)
B1: multithreaded cursors are also generated and aligned with the multithreaded cursors of A1.
A2-A3: merge and associate, and then execute the subsequent algorithm. There is no change in the writing of these two steps, but the bottom layer will automatically carry out multi-thread merging and secondary calculation, thus reducing the programming difficulty of the programmer.
IV. Structural optimization
On the basis of the previous algorithm, the computing performance can be further improved, that is, the data is stored in a hierarchical structure and the association relationship is recorded directly.
Specifically, first generate a group table file with "structural optimization ordering .dfx":
A
B
one
= connect ("orcl")
two
= A1.cursor ("select order ID, customer ID, order date from order order by order ID")
= A1.cursor ("select order ID, product ID, unit price, quantity from order detail order by order ID, product ID")
three
= file ("multi-tier order .ctx") .create (# order ID, customer ID, order date)
four
= A3.append (A2)
= A3.attach (order details, # product ID, unit price, quantity)
five
= B4.append (B2)
six
= A1.close ()
B4: add a child table to the main table and name it order details. Unlike the primary table, the child table inherits the primary key of the primary table by default, so you can omit the order ID and just write another primary key product, ID. In this way, the two tables are written in a group table file, thus forming a hierarchical structure.
B5: write data to the child table.
At this point, the group table "multi-tier order .ctx" will be stored in a hierarchical structure, as shown in the logical diagram:
# order ID
# Product ID
Unit price
Quantity
Customer ID
Order date
10248
VINET
2012-07-04
seventeen
fourteen
twelve
forty-two
nine
ten
seventy-two
thirty-four
five
10249
TOMSP
2012-07-05
fourteen
eighteen
nine
fifty-one
forty-two
forty
10250
HANAR
2012-07-08
forty-one
seven
ten
fifty-one
forty-two
thirty-five
sixty-five
sixteen
fifteen
...
...
...
...
...
...
As you can see, each master table record is logically closely related to the corresponding child table record, and there is no need for additional association, which can greatly improve the performance of the association algorithm.
When making association calculations, use the following script, "structure Optimization merge Association .dfx":
A
B
one
= file ("multi-tier order .ctx") .create ()
= A1.attach (order details)
two
= A1.cursor@m (order ID)
= B1.cursor@m (order ID, product ID)
three
= joinx (A2: main table, order ID; B2: child table, order ID)
four
= A3.groups (; sum (child table. Quantity))
A1, B1: open the main table and the child tables attached to the main table.
A2, B2: read the main table and child table in a multi-threaded way. It should be noted that there is a natural correlation between the real tables in the multi-tier group table, so there is no need to specify the segmentation relationship between the child table and the main table, and the code is clearer and simpler than before.
A3PowerA4: merge and associate and execute the subsequent algorithm, these two steps remain the same.
V. data update
The previous optimization methods are all based on the case that all database tables are exported to group table files, but the database tables always change in the actual business, so the problem of data update needs to be considered, that is, the changed data should be updated to the group table file regularly.
Obviously, updating data should be done when no one queries group table files, usually in the middle of the night or in the early hours of the morning. The frequency of updates needs to be set according to the real-time requirements of the data, such as once a day or once a week. As for the way of updating, it needs to be considered according to the law of changes in the data, the most common is the addition of data, sometimes it will encounter additions, deletions and changes.
Let's take a look at the data append:
Orders and order details generate new records every day, assuming that the records added yesterday need to be appended to the group table file at 2: 00 a. M. every day. The following figure shows the new record on 2018-11-23. Note that some orders (order ID:20001) do not have corresponding order details:
Order form
Order schedule
Order ID customer ID order date 19999APK2018/11/2220000APK2018/11/2320001APJ2018/11/2320002APL2018/11/2320003APP2018/11/24
Order ID products ID unit price quantity 199991757.1151999916204.516200001364.282000014640.22200021615.242000319245.25
The script "add group file .dfx" to append the master child table to the group table file is as follows:
A
B
two
= begin=datetime (elapse (date (now ()),-1))
= end=elapse (begin,1)
three
= connect ("orcl")
four
= A3.query@x ("select order. Order ID Master order ID, order detail. Order ID Sub-order ID, Product ID, Unit Price, quantity, customer ID, order date from order left join order detail on order. Order ID= order detail. Order ID where order date >? and order date =?", begin)
four
= file ("multi-tier order .ctx") .create ()
= A4.attach (order details)
five
= A4.cursor@m (order ID)
= B4.cursor@m (order ID, quantity)
six
= joinx (A5: main table, order ID; B5: child table, order ID)
seven
= A6.groups (; sum (child table. Quantity): total)
eight
= (A3 | A7) .groups (; sum): total)
A1: calculate the starting time of the day, that is, 2018-11-26 00:00:00.
A3: correlation calculation is carried out for the new data generated by the database on the same day. Due to the small amount of data for the day, performance is acceptable.
A4-A7: perform high-performance association calculation based on the historical data of group table files.
A8: merge the day and history, and make a second calculation to get the final result. The symbol | indicates vertical merging, which is the key to hybrid computing. In fact, this writing also shows that the aggregator supports mixed computing between arbitrary data sources, such as between Excel and elasticSearch.
For more details on the calculation of Tread0, please refer to the related article "implementation Scheme of Real-time report Thum0".
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.