In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Different from the traditional programming language, the application of sets in SPL is very common. In fact, the most common sequences and order tables are essentially sets, which can be operated on by real sets, thus greatly improving the development efficiency and code performance. Therefore, when using SPL, special attention needs to be paid to the understanding of the concept of sets.
Sequences and sets in 1 SPL
In SPL, sequences, like integers and strings, are very commonly used basic data types, and they can also perform corresponding basic operations. From the point of view of sets, SPL provides two basic operators of sets An and B, such as intersection, union, union, difference and so on: a ^ B, A | BMagol A, B, A, B, etc. If we can deeply understand and skillfully use these operations, we can more actively adopt collective thinking when solving problems, so as to make full use of the known data, the train of thought is more direct and concise, and the method is more simple and clear.
The following example shows how to use set operations to simplify code:
A1=demo.query ("select EID, NAME, SURNAME, GENDER, STATE from EMPLOYEE") 2=A1.select (GENDER== "M") 3=A1.select (STATE== "California") 4 = A2 ^ A35 = A1.select (GENDER== "M" & & STATE== "California") 6=A2&A37=A1.select (GENDER== "M" | | STATE== "California") 8=A2\ A39=A1.select (GENDER== "M" & & statues = "California")
In the code, A4, A6 and A8 use set operation to calculate the male employees of California state, all male employees or those who are not in California state, and the form is much simpler than the traditional statistical methods of A5, A7 and A9.
It should be noted, however, that although the employee data obtained in A6 and A7 are the same, the order of records in the results is different, as shown below:
The reason for this is that, unlike mathematical sets, sets in SPL are called ordered sets, are ordered, and can also have duplicate members. Sequences, ordinal tables, permutations, etc., are all such ordered sets.
A1 [1meme 2jorn 3jue 4] 2 [1jue 3je 3jue 2] 3 = [1je 2jue 3] = [1mei 3jue 2]
In the above table, the sequence in A2 has duplicate members, while the order of the members in the two sequences in A3 is different, so they are considered to be not equal when compared directly, and the result is false:
In addition, mathematically, the intersection and union of sets is commutative, that is, A ∩ B °B ∩ An and A ∪ B °B ∪ A, but because the set in SPL is an ordered set, the commutative law is not true, and the result set of the union operation will be based on the order of the left operands.
A1 [1Jing 2 Jing 3] 2 [3 Jing 1 Jing 5] 3 = A1 ^ A24 = A2 ^ A15 = A1&A26=A2&A1
The calculation results in A3Magic A4Magic A5 and A6 are as follows:
Because the sequence in SPL is an ordered set, we can't simply use the comparator = = to judge whether the two sequences have the same members, but use the function A.eq (B):
A1 = [1pyrrine 2rem 3] = = [3pje 2je 3] 2 = [3pje 2je 3] .sort () 3 = [1je 2jue 3] .eq ([3mil 2jue 1]) 4 = [1je 2mei 3] .eq ([3mei 2mei 2]) 5 = [1jue 2pr 2pr 3] .eq ([3mei 2jue 1jue 2]) 6 = [1jue 2lim 2jue 3] .eq ([3mei 2jue 3jue 1])
Determine whether the two sequences are the same in A1 and A2, and the results are as follows:
This is because after the sort function is sorted in A2, the order is the same as A1.
A3, A4, A5, and A6 all use the function A.eq (B) to determine whether the two sequences have the same members. The results are as follows:
If all members of the two sequences are the same, the two sequences are said to be permutation columns. In particular, if there is a duplicate member in the sequence, then that member needs to have the same number of repeats in its permutation column.
2 cyclic function
With the collection data type, many operations for the members of the collection can be easily written in one sentence, eliminating the need to write loop code.
2=A1.sum () 3=A1.avg () 4=A1.max ()-A1.min ()
The above table uses four cyclic functions, sum () in A2 calculates the sum of the members in the sequence, avg () in A3 calculates the average value of the sequence members, and max () and min () in A4 calculates the difference between the maximum and minimum values in the sequence. Their calculation results are as follows:
When calculating a loop function, you can use not only the values of the collection members themselves, but also the values calculated by the members, including the calculated results of the member values and the property values of the collection members with structures. At this point, you can specify the formula in the parameters of the function, where the symbol ~ is used to represent the current member in the loop calculation.
A1 [3 from EMPLOYEE 4 1] 2=A1.sum (~ * ~) 3=demo.query ("select * from EMPLOYEE") 4=A3.min (~ .Birthday) 5=A3.min (BIRTHDAY) 6=A3.avg (interval@y (BIRTHDAY,HIREDATE))
A2 in the above table calculates the sum of squares of the members in the sequence, that is, the square of the values of each member. The result is as follows:
A4, A5, and A6 loop the attribute values of each member in the collection generated by A3. A3 queries the employee information order table and generates a collection, in which each member is an employee's information. The minimum value of the earliest birthday of the employee, that is, the birthday of the member, is calculated in A4, and the results are as follows:
~ in A4 expression. It can be omitted and written as A5, so the result is the same as A4.
The average entry age of all employees, that is, the average year difference between the entry time and birthday time of each member, is calculated in A6. The results are as follows:
Executing an aggregate function with parameters can be understood as the following two steps:
1) each member in the collection is evaluated according to the parameter expression, and the result is called a calculated column.
2) then do aggregate calculation on the calculated column.
In form, it can be expressed as: A.F (x) = A. (X). F (), for example, A1.sum (~ * ~) is equivalent to A1. (~ * ~). Sum (), where A1. (~ * ~) is the column function, that is, the square of each member in A1 is calculated and returned as a sequence.
In the above example, A5 and A6 omit the symbol ~, this is because only one layer of cyclic function is used, omitting ~ will not cause ambiguity. If the nesting uses a loop function, ~ will be interpreted as a member of the inner sequence, and if you want to refer to the outer sequence member, you must precede ~ with the outer sequence name.
A1 [A _ 5=A1 B ~ C] 2 [a _ r b ~ c] 3=A1. (A _ 2. (~ / ~)) 4=A1. (A _ 2. (A _ 1) (A _ 1) 5=A1. (A _ 1. (A _ 1)) 6=A1. ((arg/~)
String concatenation operation / is used in this example. In A3, you use / concatenate two letters in the loop, but only ~ can get the members of the inner sequence A2, so the resulting string is only two repeated lowercase letters. On the other hand, A4 indicates that the previous ~ corresponds to the outer sequence during the loop, so the result is the splicing of A1 uppercase letters and A2 lowercase letters. In the expression of A5, even if the inner loop uses A1, it cannot identify which A1, so the outer A1 members cannot be referenced, so only the members of the inner sequence can be used in the calculation, so the result is repeated uppercase letters. In this case, if you need to reference the outer member, you need to use the A6 method, first assign the value of the outer member to the temporary variable, and then reference it through the temporary variable, so that you can get the result of uppercase letter cross-stitching. The calculation results in A3~A6 are as follows:
This rule about ~ also applies to the circular calculation of ordinal tables or permutations. If the field reference of ~ is omitted, the field will be interpreted as a field arranged in the inner layer first, and if the specified property field cannot be found in the inner arrangement, it will look to the outer layer again.
3 cycle order
To put it simply, the cyclic function will be calculated in the order of the original sequence, and we can make full use of this feature when using it.
A1 [1, 203=A1 3, 2, 5] 203=A1. (A2) 4 [1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
In A3, through a loop, the cumulative sum sequence of members in A1 is calculated:
In A6, calculate the longest number of continuous occurrence of member 0 in sequence A4:
There are many similar situations, and we can use only one expression to write the equivalent of simple loop code.
4 calculation sequence
In addition to the above loop functions that return a single aggregate value (such as sum, avg), in many cases we also need to continue to calculate the set, in addition to using the basic set merge, intersection, difference and other operations to generate a new set, using the calculation sequence function A. (X) to return a collection is also a common method.
A1 [1 love you8=len 2 2=A1. (~ * ~) 3=A1. (~) 4=A1. () 5=A1. (1) 6=A1. (if (~% 2)) 7i love you8=len (A7). (mid (A7)) 9=A8.count (~ = "o")
The A2~A6 in the example is calculated according to sequence A1 to generate different new sequences: A2 calculates the square of each member; A3 and A4 use the members of the original sequence to generate a new sequence; A5 cycle generates the same number of sequences as the original sequence, but the members are all 1X A6 is slightly more complex, in the loop calculation to judge the members of A1 one by one, if it is odd, get 0, otherwise get the value of the corresponding member. The calculation results of A2~A6 are as follows:
The complete way of writing A8 is = to (len (A7)). (mid (A7 quotation 1)), where the to (n) function generates a new sequence of numbers from 1 to n (after proficiency is the same as the previous symbol ~, some cases can be omitted), cycle through this sequence, take out the strings in A7 one by one, and expand into a single character sequence; A9 calculates the number of times the letter o appears in it. The results in A8 and A9 are as follows:
In addition to returning the sequence, we can also calculate the sequence and return the order table, in which case we need to use the new function.
A1 [2=A1.new (~: Origin,~*~:Square) 3=demo.query ("select * from EMPLOYEE") 4=A3.new (NAME,age (BIRTHDAY): Age) 5=A3.new (NAME) 6=A3. (NAME)
A2 returns a new ordinal table based on the A1 loop calculation, which contains two fields, one is the member of A1, and the other is the square value of the member. The ~ in the expression has been introduced earlier and represents the sequence member to which the current loop is looped. The result is as follows:
A3 takes the data from the data table EMPLOYEE to produce a sequence table, from which A4 obtains two fields NAME and BIRTHDAY, calculates the age of the employee according to BIRTHDAY, forms a new field Age, and finally generates a new sequence table containing two fields NAME and Age. The results are as follows:
A5 and A6 look similar, but in fact they are different. A5 takes the NAME field from the A3 order table and directly generates a new order table containing a NAME field, while A6 circularly calculates the sequence composed of NAME fields according to the order table in A3. The difference between the two results is that the order table has a data structure, but the sequence has no data structure:
In addition, there is a calculation-only run function, which directly modifies the original sequence itself, rather than returning a new result sequence after alignment calculation, and is generally used to modify field values for permutations (ordinal tables).
A1=demo.query ("select * from EMPLOYEE") 2=A1.new (NAME,age (BIRTHDAY): Age) 3=A2.run (Age=Age+1)
In the example, a new sequence table is generated in A2, listing the names of employees and calculating their ages. In A3, for the new sequence table A2, the age of each employee is increased by 1. The run function changes the data in the original table A2, so the result in A2 is the same as that in A3 and will return the modified result together. Using step-by-step execution, you can see the changes in the order table in A2:
5 impure sets
SPL does not require the data types of sequence members to be the same, so it is possible to treat values, strings, and complex records as members of the same sequence.
A1 [1formaea3magor2meme5.4] $[4.50], 2011-8-8] 2 = [A1JE4]
A1 contains members of multiple data types, while the sequence in A2 consists of sequence A1 and integer members. The data in A1 and A2 are as follows:
However, for general sequences, in most cases, putting different types of data in the same sequence does not make much practical business sense, so you don't need to pay too much attention to it.
However, it is really convenient for permutations, that is, sequences of records, to be made up of records from different ordered tables.
A1=demo.query ("select * from EMPLOYEE") 2=demo.query ("select * from FAMILY") 3=A1 | A24=A3.count (left (GENDER,1) = = "F")
A4 calculates the total number of women among the employees and their families. Even if the structure of the employee table and the family table is different, it can be calculated normally as long as it contains the GENDER field.
As can be seen from this example, SPL does not care whether the records in the arrangement come from the same order table, as long as they have fields with the same name, they can perform consistent operations on them, rather than having to combine two tables with different structures into a new table with Union statements as SQL does. In this way, the train of thought is clear, the writing is simple, and the excess memory will not be occupied, and the operation efficiency is higher at the same time.
6 sets of sets
In particular, the arbitrariness of collection members also allows the collection itself to be a member. At the same time, when An is a collection of sets, you can further use the functions A.conj (), A.union (), A.diff (), A.isect () to calculate the sum, juxtaposition, difference and intersection of each set in A.
A1 [[1 3=A1.isect 2 ()) 5=A1. (~. (~ * ~)]
A1 is a sequence consisting of a sequence. The sum sequence, the intersection sequence, the summation result of each sequence and the square of each member of each series are calculated by A2Magol A3Magazine A4 and A5 respectively. After calculation, the results in A2~A5 are as follows:
Similarly, permutations can be used as members of a sequence.
A1=demo.query ("select EID, NAME, SURNAME, GENDER, STATE from EMPLOYEE") 2=A1.select (STATE== "California") 3=A1.select (STATE== "Indiana") 4=A1.select (STATE== "Florida") 5 = [A2 Magi A3 and A4] 6=A5. (~. Count ()) 7=A5. (~ (1) .state) 8=A5. (STATE) 9=A5.new (STATE,~.count (): Count)
Employee data from three states, California,Indiana and Florida, are taken out in A2Magol A3 and A4, respectively. What is obtained in A5 is a sequence made up of the three permutations of A2~A4, which is a collection of sets:
A6 calculates the number of employees in each state, and the results are as follows:
A7 takes the name of each state, and the ~ (1) in the expression can be omitted, that is to say, A8 and A7 are equivalent, and the result is the same:
A9 looks the same as A6, counting the number of employees in three states, but a new order table is generated through new, which looks clearer and is convenient for future retrieval based on the state name:
7 understanding grouping
Grouping is a common operation in SQL, but not everyone can understand it deeply.
From the point of view of sets, the essence of grouping operation is to split a set into several subsets according to some rules, that is to say, its return value should be a set composed of several sets. However, people often do not need to directly look at these subsets in the set, but are more interested in some summary values of the subset, so grouping is often accompanied by further aggregate calculation of the subset.
This is how SQL is handled, and its GROUP BY statements always match the corresponding summary calculations. Of course, this is also because SQL itself does not have an explicit collection data type, so data such as "collections of collections" cannot be returned directly, and summary calculation can only be imposed after grouping calculation.
Over time, people get used to grouping always need to cooperate with the subsequent summary calculation, and forget that grouping and summarization are actually two separate steps.
But in any case, there will still be times when we are interested in these grouping subsets. And to say the least, even if you are only interested in summary values, it is valuable to keep these subsets, because if you can reuse them without having to regenerate them every time, it will be of great help in terms of code simplicity and performance improvement.
For SPL, because it fully realizes the collective thinking, it can restore the original intention of grouping operation. In fact, the basic grouping function in SPL only does pure grouping and strips out the aggregate calculation.
AB1=demo.query ("select * from EMPLOYEE")
2=A1.group (month (BIRTHDAY), day (BIRTHDAY)) / groups employees by birthday (month, day) 3=A2.select (~ .len () > 1) / employees with other life days with the same 4=A3.conj ()
5=A1.group (STATE) / group employees by state 6=A5.new (~ (1) .STATE: State,~.count (): Count) / calculate order table with grouping results, number of state employees 7=A5.new (STATE,~.avg (age (BIRTHDAY)): Age) / calculation order table, average age of state employees
The result of grouping itself is a collection of collections, so of course you can continue grouping. Each member of the grouping result set is also a set, and they can also continue to group. These are two different operations, but both form a multi-tier set.
AB1=demo.query ("select * from EMPLOYEE")
2=A1.group (year (BIRTHDAY)) / grouped by employee's year of birth 3=A2.group (int (year (~ (1) .Birthday) 0and10))
4=A2.group (int (year (BIRTHDAY) 0lap10))
5=A2. (~ .group (month (BIRTHDAY) / groups the grouped results again, and A3, A4, and A5 will all return the arranged sequence.
If the level of the result of set operation is too deep, then the realistic business meaning may not be very big, but it can be used to understand the way of thinking of set and the essence of operation.
While grouping, the group function sorts the groups according to the results of the grouping expression at the same time, such as:
AB1 $select EID,NAME+''+ SURNAME FULLNAME, DEPT from EMPLOYEE
2=A1.group (DEPT) = A2.new (~ .Dept: DEPT,~.count (): Count) 3=A2.sort (~ .Dept:-1) = A3.new (~ .Dept: DEPT,~.count (): Count) 4=A1.group@u (DEPT) = A4.new (~ .Dept: DEPT,~.count (): Count) 5=A1.group@o (DEPT) = A5.new (~ .Dept: DEPT,~.count (): Count)
The sequence table obtained in A1 is as follows:
Employee data is grouped by department name in A2, and by default, the grouping results in A2 are sorted in ascending order by department name. The number of departments under various grouping conditions is counted in column B in order to view the sorting directly through the DEPT column. The results in A2 and B2 are as follows:
A3 changes the grouping results in A2 to sort by department descending order. The effect can be seen in B3. The results are as follows:
In addition to reordering the grouping results, you can also add options to adjust the grouping order when you execute group.
Add to A4 to maintain the original order in which departments appear in the employee table when grouping through the @ u option.
The @ o option added in A5 specifies that records are not sorted as a whole when grouped, but only adjacent records with equal grouping expressions are grouped, so it is more like an "adjacent merge". Obviously, this situation may occur in the "repetition" grouping. B4 and B5 show the effect of these two situations:
8 non-equivalent grouping
In addition to the regular group function, SPL also provides an A.align@a () function that handles alignment packets and an A.enum () function that handles enumerated packets.
We call grouping through the group function equivalent grouping, which has the following characteristics:
1) any member of the original set must be in and can only be in a certain subset, that is, the grouped subset members completely cover the original set, and there is no overlap between the subsets.
2) No empty subset
Aligned groups and enumerated groups do not necessarily satisfy these two points.
Aligned grouping means that the grouping expression is calculated with the members of the set, and the grouping is completed according to the one-to-one correspondence between the calculated results and the values in a pre-specified sequence. You need to take the following steps to align the grouping:
1) specify a set of values in advance
2) divide the members of the set to be grouped into the same subset whose evaluation result of an expression is the same as the specified value.
3) each subset of the result will correspond to the value specified in advance.
Under this grouping rule, a member may not be in either subset, an empty set may appear, or a member may exist in both subsets.
Group employees by a specified sequence of states, as in the following example:
A1=demo.query ("select * from EMPLOYEE") 2 [California,Florida,Chicago] 3=A1.align@a (A2MagneState)
In A3, set A1 is grouped according to A2 alignment, and the state names of A1 members correspond to A2 members. During such a grouping process, it is possible that some employees are not in any group (employees from other states), or there may be empty groups without any members (Chicago is not a state name and there is no corresponding employee at all). For example, in some data case, A3 results:
Enumerated grouping means that a set of conditions are specified in advance, and the conditions are calculated by taking the members of the set to be grouped as parameters, and those who establish the conditions will be divided into corresponding subsets. At this point, there may also be a member that is not in either subset, an empty set, or a member in both subsets.
Group employees by a specified age group, as in the following example:
A1=demo.query ("select EID, NAME, SURNAME, GENDER, BIRTHDAY from EMPLOYEE") 2 [?
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.