Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Performance Optimization of arbitrary combination Summary report of Multi-tier subjects (part two)

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

2.4 ordered computing scheme

After making full use of the characteristics of traversal for optimization, we may also feel that the computing performance is a little slow, hoping to have room for further optimization. Since only a small part of the total amount of data needs to be taken out at a time (there are about hundreds of item numbers involved in 100 indicators, that is, hundreds of records out of millions of records), what we can usually think of is that if we can use the orderly data to search directly (if the source data is orderly, we can quickly locate these hundreds of records without having to traverse millions of records or more). Will be able to achieve better query efficiency.

We can use the iselect () function provided by the aggregator to search each calculated metric in order to reduce the number of traverses. There are two key points to note here:

1. The iselect () function looks up faster with a single primary key than with multiple primary keys, and it is much easier to write.

2. In data preprocessing, when you encounter multiple primary keys, you should find a way to merge them into one, and sort them after digitization in order to use the iselect () function.

The specific usage of the iselect () function and the explanation of ordered computation are not discussed here, but can be found in the relevant chapters of the aggregator tutorial.

2.4.1 merge primary keys, sort

In order to meet the two key points mentioned above, we need to preprocess the source data again. We have already talked about the principles of grouping to calculate the summary value and using cross-row groups to calculate the cumulative value. Here we mainly talk about merging primary keys and sorting.

First, in the original data, use the "year" and "month" column fields to dynamically calculate a variable value, called "month number", so that it can be merged with the account field to form a unique primary key. The corresponding changes in the code are as follows:

A

B

one

= file ("General Ledger voucher-pre.btx")

two

= file ("General Ledger voucher-mid.btx")

three

= A1.cursor@b ()

> A3.run (year-inityear) * 12 + month): month)

four

= A3.groupx (subject, month: month number; sum (amount): amount)

five

For A4; subject

= A5.run (amount = amount [- 1] + amount)

six

> A2.export@ab (B5 recording 1: subject, # 2: month number, # 3: cumulative amount)

The code for other lattices has been explained earlier, so I won't repeat it here.

B3: first, define the parameter name: inityear in the aggregator, and set the value to 2014, as shown below:

Assume that the original data started in 2014, so set the default value for the initial year to 2014. The so-called "month number" means that the time of each record is the month ordinal from January of the initial year. For example, if the current data record is middle-aged 2017 and the month is 3, then according to the result of this formula: month number = (2017-2014) * 12: 3, that is, the 39th month beginning in January 2014. The result of the calculation is reassigned to the month field using the run () function so that a unique primary key is constructed later with the account.

A4: group by subject and monthly number, and sum the amount (explained earlier)

B5: accumulate the amount field (explained earlier)

B6: the calculated result set is appended to the set file (explained earlier), that is, the general ledger voucher-mid.btx. The execution result is as follows:

The second step is to summarize the amount of the current N digits of the subject; how to calculate the total value of multi-tier accounts has been mentioned above, here we mainly focus on merging the monthly number and the account into the primary key key, and then sort them. The month number is calculated as 2 digits (assuming that the data record span is no more than 99 months), and the subject is fixed 10 digits, so in order to ensure the uniqueness of the merged primary key, the total length of the new primary key needs to be defined as 12 digits.

In this way, the construction rule of the new primary key is: key (12 digits) = month number (2 digits) 10000000000000 + general ledger account (up to 10 digits). There is a trick to explain: here set the length of key to 12 bits, can be stored in a long type, if longer (related to requirements), you will use strings, although it will be relatively slow, but it does not have much impact.

The SPL script for the aggregator is as follows:

A

one

= file ("General Ledger voucher-mid.btx")

two

= file ("General Ledger voucher-later.btx")

three

= A1.cursor@b ()

four

= channel (A3) .groupx ((subject\ 100): subject, monthly number; sum (cumulative amount): cumulative amount summary)

five

= channel (A3) .groupx ((subject\ 10000): subject, monthly number; sum (cumulative amount): cumulative amount summary)

six

= A3.groupx ((account\ 1000000): subject, monthly number; sum (cumulative amount): cumulative amount summary)

seven

= [A6MagneA5.result (), A4.result ()] .conjx ()

eight

= A7.new (# 2100000000000000000000000000000000000000000000000000000000000000000000): accumulative amount summary) .sortx (key)

nine

> A2.export@z (A8)

A1-A3: I've already explained that, so I won't repeat it here.

A4: create a pipeline to push the data from Vernier A3 to the pipeline, where the ch.groupx () function groups the ordered records in the pipeline and returns the pipeline; the first 8 digits and monthly numbers are grouped by subject, and the cumulative amount is summarized. The returned data structure is shown in the following figure:

A5: in the same way, the return pipeline of A4 will be grouped according to the first 6 digits and monthly number of subjects, and the cumulative amount will be summarized. The returned data structure is shown in the following figure:

A6: return the cursor, group the first 4 digits and monthly number according to the subject, and summarize the cumulative amount. The returned data structure is shown in the following figure:

A7: multiple cursor operation results are merged into a single result set, where ch.result () represents the pipeline operation result.

A8: generate a new sequence table for each record of A7, which contains two fields: the month number (2 digits) 10000000000, and then the new account generated after interception (the longest is 10 digits), redefined as key and cumulative amount summary two column fields, and then sort the key.

In particular, after the cs.groupx () function is grouped according to the field, it will sort the field, that is to say, the result after the operation is in order, so we can take advantage of this feature to first group by the month number (write before), and then use the cs.mergex () function to do an orderly merge operation according to the month number and subject, so that the merged results no longer need to be sorted. The corresponding code changes are as follows:

A

...

...

four

= channel (A3) .groupx (monthly number, (subject\ 100): subject; sum (cumulative amount): cumulative amount summary)

five

= channel (A3) .groupx (monthly number, (subject\ 10000): subject; sum (cumulative amount): cumulative amount summary)

six

= A3.groupx (monthly number, (subject\ 1000000): subject; sum (cumulative amount): cumulative amount summary)

seven

= [A6MagneA5.result (), A4.result ()] .mergex (month number, subject)

eight

= A7.new (# 110000000000000000000: cumulative amount summary 3: accumulative amount)

...

...

A9: the calculated result set is exported and saved to the set file, that is, general ledger voucher-later.btx. The data structure is shown in the following figure:

2.4.2 orderly query

In this way, we completed the data preprocessing as required, and then verified the report query in two steps:

1. Define the subroutine: any given index can quickly return the total value of the index, and then call the subroutine many times to complete the calculation of 100 indicators and return the result set.

2. Given any 100 calculated indicators, quickly return the total value of the corresponding indicators.

2.4.2.1 multiple ISELECT queries

First of all, define a subroutine, any given a calculation indicator (perhaps only the first N bits of the item number, such as the first 4 bits / 6 bits / 8 bits / 10 bits, etc., free combination appears), return the total value of this indicator.

Then, the 100 metrics are calculated by calling the subroutine. First, the query parameters are defined. Yyyy represents the query year and mm represents the query month. For example, query the data of January 2017, as shown in the following figure:

An example of calling a subroutine:

A

B

C

one

= inityear=2014

= ((yyyy-inityear) * 12+mm) 10000000000

= file ("General Ledger voucher-later.btx")

two

Func

three

= A2. (B1+)

= B3.sort ()

four

= C1.iselect@b (C3Query key)

five

= B4.fetch ()

six

Return B5.sum (cumulative amount summary)

/ indicator parameter column

seven

= func (A2maeC7)

[1001,1002]

eight

= func (A2maeC8)

[2702,153102,12310105,1122,12310101,12310401,12319001,12310201,12310301,12310501,12310601,12310701,12310801,12319101]

...

...

...

one hundred and seven

Return [A7:A106]

A1: define the variable inityear, assuming that the original data started in 2014, so set the default value to 2014

B1: generate the "month number" according to the same rules as before. If the parameter is January 2017, the execution result is as follows:

C1: data file object after preprocessing

A2-C6: subroutine code. A subroutine is a block of code with the statement func as its nominative, and the result is returned with a return statement. The main function of this subroutine is to return the summary value for any given index.

B3: receive each item number in the parameters, using the month number (month number is 12 digits) plus the current item number to form a set of parameters of the index. For example, if the input parameter is [1214jue 1207], the execution result is as follows:

C3: then sort the set of index parameters B3. The execution result is as follows:

B4: returns the cursor by comparing the parameter set in indicator C3 with the ordered key field in the result set file, where @ b represents reading from the set file.

B5: get the record from the cursor, and the execution result is as follows:

B6: sum the cumulative amount and return the calculation result of the indicator.

C7: parameter condition of indicator A (a set of values intercepted by the first four digits of the account number)

C8: the parameter condition of index B (the parameter set formed by the values intercepted by the first 4 bits / the first 6 bits / the first 8 bits of the subject number), the remaining 98 indicators are calculated in a similar way to A8, and the parameters are written in a similar way to C8, and then to 100.

A7: call the func subroutine, transfer the index parameter values of C7 into the subroutine, and the subroutine returns the result after calculation.

A8: similarly, calculate the result set of indicator B

A107: merges the values of each cell of the A7-A106 (from top to bottom, the calculation results of 100 indicators) and returns a single-column dataset that can be used by the reporting tool.

This can already take advantage of ordered queries, but the iselect () function of the subroutine needs to be executed 100 times to calculate 100 indicators, which is still too much traversal and the coding process is cumbersome.

2.4.2.2 one ISELECT query

So, is there a way to do an iselect query just once?

The answer is yes! We can sort out the item numbers of the 100 indicators that need to be calculated, and then execute the iselect () function to find out the summary results of all the indicators, so that we are done.

Here, there are two key points to note:

1. It is necessary to merge different item numbers in multiple calculation indicators, construct primary keys and sort them.

2. Using the function skill of pos (), according to the primary key constructed by multiple item numbers and month numbers in each calculation index, find the coordinate position in the result set (compared with the key column field), return the position sequence number, then sum the cumulative amount summary field found in the result set according to the position sequence number, and then reverse the sum result back to each index according to the position sequence number, that is, the summary value of each index is calculated.

To make it easier to understand, for example, explain in detail how to use the pos () function to achieve location calculation? The schematic diagram is as follows:

Explanation: all the item numbers of index An and index B are merged, and then sorted uniformly to generate the sequence number, through the sequence number to find the corresponding amount in the ordered result set, and then use the position sequence number to pour the amount back to each index. the sum of the amount of multiple item numbers under each index, that is, the total value of the index.

Finally, an example SPL script for an aggregator that calculates 100 metrics is as follows:

A

B

C

D

E

one

/ parameter variable

= now ()

= ((yyyy-inityear) * 12+mm) * 10000000000

two

[1001,1002,1012]

[2001]

[1101]

[1121,12310106,12310206,12310306,12310406,12310506,12310606,12310706,12310806,12319006,12319106]

[2101]

...

...

...

...

...

...

twenty-one

[221102]

[1221,12310102,12310202,12310302,12310402,12310502,12310602,12310702,12310802,12319002,12319102]

[2221]

[1321,1401,1402,1403,1404,1405,1406,1407,1408,1409,1411,1412,1461,1471]

[1403,147101,1471050100]

twenty-two

= [A2:E21]

= A22. ((C1 +))

= A22.union () .sort ()

twenty-three

= file ("General Ledger voucher-later.btx")

twenty-four

= A23.iselect@b (C22pyrkey)

= A24.fetch ()

= B24. (key)

= B24. (cumulative amount summary)

twenty-five

= A22. (C24.pos@b ())

twenty-six

= A25. (.sum (D24 (~)

twenty-seven

Return A26

= interval@ms (B1 now ())

A22: merge the item numbers of 100 calculated indicators in the range from A2 to E21, where A2 grid represents indicator 1, and so on. After merging, the execution result is as follows:

B22: for each item number in the A22 index, use the month number (2 digits) * 10000000000, plus the current item number, to form a set of parameter groups of the index. The execution result is as follows:

C22: merge the set of metric parameters, and then sort

A23: open the preprocessed set file object

A24: the set of parameter groups constructed in indicator C22 is compared with the ordered key field in the file, and the record is returned as a cursor.

B24: get the record from the cursor and return the result set of all the items queried. The execution result is as follows:

C24-D24: get the summary of key and cumulative amount in the result set respectively.

A25: according to the order of each item number in A22, compare the key column with the primary key constructed by the current account number + month number in the B24 result set, and then return the position sequence number, where @ b in the pos () function represents the use of dichotomy search, which is more efficient, but requires the searched sequence to be orderly. The result of the operation is as follows:

A26: using the sequence number of the coordinate position of each member of A25, the summation is carried out by comparing the accumulated amount summary field found in the result set B24, and the summation result is then reversed to each index according to the position sequence number. for example, the result returned by the serial number 1 represents the summary result of the index 1 queried by the parameters of A2, and the result returned by the serial number 2 represents the summary result of the index 2 queried by the parameters of B2. And so on. That is, the calculation of the total value of each index is completed. The execution result is as follows:

A27: returns the result set for use by the reporting tool.

At this point, using a traversal, get everything done, the problem can be easily solved!

Measured results: it takes about 1-2 seconds for the report to show the whole link from the number to the display, in which the index calculation part takes less than 1 second.

2.5 as a report data source

For the report production process, there is no need to change, just need to switch the data source to the aggregator. Suppose the report tool is to dry the report V5:

First of all, import the excel table sample, create the dataset type as the aggregator, select the dfx script that has been done, and set the corresponding query parameters.

Then, the summary results of each index can be obtained by taking values in each cell of the report in order; for example, the expression of cell C5 is written as follows: = ds1.select (# 1) (1), cell C6 is written as = ds1.select (# 1) (2),. Report cell expressions from top to bottom, and so on The example is shown in the following figure:

2.6 Summary

In the actual report development process, when we encounter problems, we often can not think of the best solution from the beginning. We can try to achieve it in the simplest and easiest way, and then optimize it step by step; compare the shortcomings of each scheme and the performance improvement brought about by the improvement, so as to finally meet the business needs.

We have adopted this approach in this article, and the steps of gradual optimization are as follows:

1. Multiple traversal scheme

2. One-time traversal scheme

3. Summarize the scheme in advance, and optimize the query part on the basis of 2.

4. Ordered calculation scheme

In the whole process, the related technologies of aggregator include: cursor, pipeline, ergodic reuse, data external, grouping subset, cross-row group calculation, ordered calculation / query, dichotomy search, location serial number and so on.

After understanding these concepts and skillfully mastering the functions related to the aggregator, we can write efficient code to quickly achieve the preparation of the report dataset!

Third, the report system structure after introducing the aggregator.

In the actual business, we use this scheme to measure the POC according to the production environment data provided by customers (the original data table is about 6000 million records). The results show that the balance sheet, which used to take 30-40 seconds to present, is now within 1-2 seconds. Moreover, aggregator scripts can be managed together with report templates, thus effectively reducing the complexity of application management.

After pre-processing calculation by using the aggregator, the data cache file formed can well optimize the existing report implementation mode and effectively solve the problem of slow operation of big data report.

The comparison between the original mode and the report system structure after introducing the aggregator is shown in the following figure:

According to the business characteristics of the report, the specific implementation steps are usually as follows:

1. The aggregator extracts data from the data source, and according to the business rules of the report, takes out the required detailed records, such as dimensions, filtering fields, calculation indicators, and so on.

2. The aggregator pre-processes and calculates the detailed records to generate all kinds of indicators needed in the report.

3. The calculated indicators are stored in the form of data cache files, which can be managed by multi-level catalogs according to business types, module relationships and time order, or together with report templates.

4. The report tool calls the aggregator through JDBC, and the calculation result is returned to the report tool for presentation.

5. You can set the scheduled tasks to be executed regularly to complete the above data preprocessing actions.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report