Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Optimization skills of SPL grouping

2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

1. General grouping

Groups can be used when the result set of group aggregation is small. For example, there are student scores stored in the set file, the table structure is as follows:

ScoresclassstudentIDsubjectscore...

Now calculate the total score of each student:

A1=file ("scores.btx") .import@b () 2=A1.groups (studentID; sum (score): TotalScore)

A1: the data of the score sheet is loaded into memory.

A2: group according to the studentID field and calculate the total score.

If the score table data is too large to be loaded into memory, you can use cursors:

A1=file ("scores.btx") .cursor@b () 2=A1.groups (studentID; sum (score): TotalScore)

A1: get the cursor of the score sheet.

A2: group according to the studentID field and calculate the total score.

When there are many records in the score sheet and the result set of group calculation is very large, use groupx:

A1=file ("scores.btx") .cursor@b () 2=A1.groupx (studentID; sum (score): TotalScore;10000)

A1: get the cursor of the score sheet.

A2: group according to the studentID field and calculate the total score.

The last parameter of groupx is the number of buffer rows, that is, the number of records in memory that can be processed at one time, which is specified as 10000, which should be adjusted according to your own situation.

two。 Ordered grouping 2.1. Front semi-ordered

Groups@o can be used when the data pairs of grouped fields are in order. As in the example above, when the score sheet is in order for the studentID field, you can add the option o:

A1=file ("scores.btx") .import@b () 2=A1.groups@o (studentID; sum (score): TotalScore)

If the score sheet is too large to load memory, you can use cursors:

A1=file ("scores.btx") .cursor@b () 2=A1.groups@o (studentID; sum (score): TotalScore)

Similarly, if the result set of ordered grouping is large, groups cannot be used, and group is used:

A1=file ("scores.btx") .cursor@b () 2=A1.group (studentID; sum (score): TotalScore)

Unlike groups, the return of group is a cursor.

2.2. The latter semi-ordered

The "second half" order means that the table T to be grouped has ordered the fields aand b, and now we are going to group the table T by field b. For example, there is a set file sale.btx that keeps sales records. The table structure is as follows:

SalesdateemployeeIDamount...

The sales table is ordered by the fields date,employeeID. Now you need to calculate the total sales of each employee and group the employeeID field. In this case, the regular grouping will calculate hash, but in this case, you can use groups@h to take a look at the implementation:

A1=file ("sale.btx") .import@b () 2=A1.groups@h (employeeID; sum (amount): TotalAmount)

If the sales table is too large to load memory, you can use cursors:

A1=file (sale.btx). Cursor@b () 2=A1.groups@h (employeeID; sum (amount): TotalAmount) 3. Serial number grouping

You can use groups@n when the result of a grouping expression is an ordinal. Still using the sales table in the above example, we now need to calculate the monthly sales, and the implementation is as follows:

A1=file (sale.btx) .cursor@b () 2=A1.groups@n0 (month (date): MONTH;sum (amount): TotalAmount)

A1: get the cursor for the sales table.

A2: convert the date field to a month and calculate the total sales for each month; if there is a case where the date field is null, add the option @ 0.

After using @ n, it is grouped according to the sequence number, instead of calculating the HASH, so it is faster than without @ n.

Using 300 million data tests, the actual test results of this case:

It takes time (seconds) to have @ n without @ n39494. Filter split

Filter splitting refers to grouping according to conditions, dividing into two sets that meet and do not meet the conditions. This split can be done using select, but iterated twice, once select (conditional true), and once select (conditional false).

Using group and align only needs to be traversed once. Take the above score sheet as an example. Now we want to divide those who pass and those who fail into two groups. Let's take a look at the implementation of group:

A1=file ("scores.btx") .import@b () 2=A1.group (score > = 60)

It is important to note that using group filtering to split the result is not necessarily two sets, it is possible that the result is only one set, as in this case, there may be scores greater than 60. At this point, you need to determine the number of result sets, or use align, because the result of align must be two sets, there is no qualified record, there will be an empty set. Take a look at the implementation of align:

A1=file ("scores.btx") .import@b () 2=A1.align@a ([true,false], score > = 60)

If you want to write the unsatisfied to the file in the filter split, you can use select (x _ share file) and take a look at the implementation:

A1=file ("scores.btx") .cursor@b () 2=A1.select (score > = 60 world file ("Fail.btx")) 3=A2.fetch ()

After the completion of the implementation, the grade in A3 is passed, and those who fail will be saved to the set file Fail.btx.

If you want to split into multiple sets, you can use groupn. For example, to divide grades into sets of excellent, good, passing and failing grades can be achieved as follows:

A1=file ("scores.btx"). Cursor@b () 2 = [file ("Excellent.btx"), file ("Good.btx"), file ("Pass.btx"), file ("Fail.btx")] 3=A1.groupn (if (score > = 90:1, score > = 80:2, score > = 603file); A2) 4=A3.fetch () 5. Intra-group TopN

Top can also be used in group, such as calculating the maximum salary in each department:

A1=file ("employee.btx"). Cursor@b () 2=A1.groups (dept;top (- 1): MaxSalary)

A1: get the cursor of the employee table.

A2: grouped according to the dept field, and then find the maximum salary value in the group, named topSalary.

The MaxSalary field of the A2 execution result is a numeric value. If you want to query the information of the employee with the highest salary, you can write it as follows:

A1=file ("employee.btx"). Cursor@b () 2=A1.groups (dept;top (- 1): MaxSalary)

Top can not only return the maximum value, but also get the first n values, so write 1 as n. For example, calculate the information of the top three salaried employees in each department:

A1=file ("employee.btx"). Cursor@b () 2=A1.groups (dept;top (- 3): topSalary) 6. Parallelism

When using groups for grouping calculations, you can also further improve performance in parallel, with the option @ m added.

As in the example above, calculate the total sales of each employee and take a look at the implementation of the parallel approach:

A1=file ("sale.btx") .import@m () 2=A1.groups@m (employeeID; sum (amount): TotalAmount)

If the data is too large to be loaded into memory, you can use multiway cursors:

A1=file ("sale.btx") .cursor@mb () 2=A1.groups@m (employeeID; sum (amount): TotalAmount)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 244

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report