In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Background of a problem
Let's start with a balance sheet:
This is a typical Chinese complex report format, and its complexity does not lie in the layout, but in that each cell of the "final balance" is an indicator that needs to be calculated independently, and there is almost nothing to do with each other. in fact, it is a summary list of various indicators, and these indicators often have as many as hundreds.
In the source data table structure, there is a field called account, whose length is always fixed 10 digits, such as 1234567890, as shown in the following figure:
The value of the account field is actually a hierarchical code, and the hundreds of indicators in the previous table are the statistical results of account data at different levels according to the demand. The specific method is to determine the level by intercepting the first few bits of the account. Then freely combine according to the demand, filter as a condition, and finally accumulate the amount field.
For example, the list of subjects corresponding to the calculation index An is [1001jue 1002], which represents all the subjects with the first four positions of 1001 and 1002, which are written in SQL: select sum (amount) from T1 where concat (year, month) A1.close ()
A1: connect to the database demo
A2: return from creating database cursors based on sql
A3: where the set file is saved
A4: export the entire table data and save it to a file, where the @ b option of the export () function is written to the set file, that is, the general ledger voucher-pre.btx
A5: close the database connection
2. Traverse the source data at once and calculate 100 metrics. The SPL script of the aggregator is as follows:
A
B
C
one
= file ("General Ledger voucher-pre.btx")
/ indicator parameter column
two
= A1.cursor@b ()
three
= A2.select (concat (year, month) A2.export@b (A4)
A1: open the set file object of the original data table before preprocessing
A2: where the set file of the intermediate result data is saved after calculation
A3: returns from file creation cursors, where the @ b option of the cursor () function represents a read from the set file
A4: first grouped by subject, year and month, and sum up
A5: the calculation result of performing A4 is written to the set file, where the export () function uses the option of @ b, and @ b represents the written set file format, that is, general ledger voucher-mid.btx.
2.3.2 using cross-line groups to calculate cumulative values
Our final problem is to calculate the final value of the indicator, that is, the cumulative value of the amount as of a certain date; the previous step is to calculate the total value of the amount of the current month, so how to calculate the cumulative value of the amount?
The aggregator provides syntax for cross-line references, and you can use A [- 1] to represent the An of the previous row, so that you can calculate: cumulative value = cumulative value of the previous row + current row value.
In the script, then make the following modifications in the previous step to calculate the cumulative value:
A
B
one
= file ("General Ledger voucher-pre.btx")
two
= file ("General Ledger voucher-mid.btx")
three
= A1.cursor@b ()
four
= A3.groupx (subject, year, month; sum (amount): amount)
five
For A4; subject
= A5.run (amount = amount [- 1] + amount)
six
> A2.export@ab (B5 recording 1: subject, # 2: year, # 3: month, # 4: cumulative amount)
The other grid code has been explained above, so I won't repeat it here.
A5: use the for loop cursor A4, where the semicolon parameter "account" indicates that each time a set of records with the same item value are read from the cursor. Let's step through it and return the record of a particular subject:
Then execute a for loop to return the records for the next set of subjects:
B5: accumulate the amount for the records of the same account taken out; where the expression: amount = amount [- 1] + amount, the amount represents the amount of the current bank, and the amount [- 1] represents the cumulative amount of the previous row. Add and calculate and then re-assign to the amount field. The following figure shows the change of the result after the execution of the A5 grid:
B6: the result of the calculation is written to the set file. The export () function uses the option @ ab, and @ b represents the set file format. Because in the for loop, it needs to be executed many times, so use @ a to gradually save the results to the file in an additional way to ensure the integrity of the file; that is, the general ledger voucher-mid.btx. Some of the execution results are shown below:
2.3.3 construct the total value of multi-tier subjects
Now that we have calculated the cumulative value of the detailed accounts, we also need to calculate the summary value corresponding to the high-level accounts (the first N digits).
As can be seen from the requirements, each calculation index intercepts the first 4 bits, the first 6 bits and the first 8 bits according to the subject as a set of parameters, so it is necessary to match this rule when constructing different levels of subject numbers. thus the aggregation values of different levels are calculated.
For example, if the account is 1234567890, then you need to add the summary amount corresponding to the account numbers 1234, 123456, and 12345678. That is, for each 1234567890 such 10-digit subject number, we also need to add 4, 6, 8-digit subjects 1234, 123456, 12345678 respectively. Among them, subject 1234 will accumulate the amount values of all accounts beginning with 1234, and so on. In fact, this is the common means of CUBE.
It should be noted that based on the calculation results of the previous step, the amount of data needs to be discussed in two cases:
1. The results can be directly read into memory to participate in the next step of calculation.
2. The result is still very large, and external memory calculation is needed (Vernier technology can be calculated while reading, and multiple calculations also need pipeline technology to cooperate)
2.3.3.1 memory calculation
If the result set can be fully loaded into memory, an example of a multi-level account summary value constructed by the aggregator SPL script is as follows:
A
one
= file ("General Ledger voucher-mid.btx")
two
= file ("General Ledger voucher-later.btx")
three
= A1.import@b ()
four
= A3.groups ((account\ 100): subject, year, month; sum (cumulative amount): cumulative amount summary)
five
= A4.groups ((account\ 100): subject, year, month; sum (cumulative amount summary): cumulative amount summary)
six
= A5.groups ((account\ 100): subject, year, month; sum (cumulative amount summary): cumulative amount summary)
seven
= [A6, A5, A4] .conj ()
eight
> A2.export@z (A7)
A1: open the set file object of the intermediate calculation result
A2: where the calculated result set file is saved
A3: the content read from the file object A1 is returned as a record forming the result set; where @ b represents reading from the set file.
A4: group the top 8 digits by subject, year and month, and summarize the cumulative amount. If you intercept the top 7 digits, you need to write as follows: (account\ 1000). The specific number of bits to intercept is determined by the demand scenario. The execution result is as follows:
A5: on the basis of the result set of A4, the current 6 digits, years and months are grouped by subject\ 100, and the cumulative amount is summarized. The implementation result is as follows:
A6: similarly, on the basis of A5, the current 4 digits, years and months are grouped according to the subject\ 100, and the cumulative amount is summarized. The implementation result is as follows:
A7: multiple result sets are merged into one result set
A8: the calculated result set is exported and saved to a file, where the export () function uses the option @ z to represent the segmented writing to the set file, that is, general ledger voucher-later.btx.
2.3.3.2 out-of-memory calculation (Vernier + pipeline)
In the previous example, we have used cursors, which need to be emphasized that cursors can only be moved forward and backward one-way, performing a traversal calculation, and only the cs.fetch () function in the resulting cursor can effectively get the data. After traversal, other cursors generated during the calculation will not be able to read the data again.
However, sometimes, in the process of reading data, we need to calculate multiple results at the same time, so we need to use a pipeline similar to a cursor, and use channel (cs) to set up a pipeline to press the data of the cursor cs into the pipeline while traversing to perform other operations.
Compared with memory, the speed of out-of-memory is much slower, so it is necessary to minimize hard disk access, so we use the cursor + pipeline mechanism to traverse at one time to get the required summary results:
A
one
= file ("General Ledger voucher-mid.btx")
two
= file ("General Ledger voucher-later.btx")
three
= A1.cursor@b ()
four
= channel (A3) .groupx ((subject\ 100): subject, year, month; sum (cumulative amount): cumulative amount summary)
five
= channel (A3) .groupx ((subject\ 10000): subject, year, month; sum (cumulative amount): cumulative amount summary)
six
= A3.groupx ((account\ 1000000): subject, year, month; sum (cumulative amount): cumulative amount summary)
seven
= [A6MagneA5.result (), A4.result ()] .conjx ()
eight
> A2.export@z (A7)
A1-A3: I've already explained that, so I won't repeat it here.
A4: create a pipeline to push the data from cursor A3 to the pipeline, where the ch.groupx () function groups the ordered records in the pipeline and returns the pipeline; the first 8 digits, years and months are grouped by subject, and the cumulative amount is summarized
A5: in the same way, the return pipeline of A4 will be grouped according to the first 6 digits, year and month, and the cumulative amount will be summarized.
A6: return the cursor, group the first 4 digits, year and month by subject, and summarize the cumulative amount
A7: multiple cursor operation results are merged into a single result set, where ch.result () represents the pipeline operation result.
A8: the calculated result set is exported and saved to the set file, that is, general ledger voucher-later.btx
2.3.4 optimize the scheme of "once traversal"
After the above two steps of data preprocessing, the result data can be directly used as the data source of the report, as long as the calculation conditions of each index can be compared equally, and there is no need to intercept and calculate the first few bits.
So based on the aforementioned "once traversal" scenario, let's do some optimization; the sample SPL script for the aggregator is as follows:
A
B
C
one
= file ("General Ledger voucher-later.btx")
/ indicator parameter column
two
= A1.cursor@b ()
three
= A2.select (concat (year, month)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.