In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
SPL is a programming language for structured data computing. The aggregator is the java implementation of SPL language. It provides an IDE environment for coding and debugging in the form of net format programming. The syntax is easier to understand than Java and SQL, and the development efficiency is higher. This article will list some tips that can improve computing performance based on the implementation principle of the aggregator.
1 data type
1.1 numerical value
The numerical types in SPL are Integer, Long, Double, and BigDecimal. Although BigDecimal can represent data with arbitrary precision, its calculation speed is much slower than other types and takes up much more memory, so when other types of numbers can meet the accuracy requirements, using other types of numbers instead of BigDecimal can significantly improve the computational efficiency.
In the actual case, when using JDBC to read database data, the JDBC of some databases also returns BigDecimal for low-precision values, so that you can check whether it can be converted to other types when doing performance optimization, so as to improve performance.
1.2 string
The string object String of Java takes up a lot of space. A string of length 0 takes up more than 40 bytes, while Integer and Long only occupy 16 bytes. At the same time, the comparison operation and hash operation of strings are also slower than Integer and Long.
In addition, data is read from the hard disk to generate java objects, which often take up several times or even ten times the size of the hard disk (if the hard disk storage uses compression technology, the gap will be even greater). This situation may directly lead to memory overflow when small data files are read as java objects, so if you can't reduce the memory footprint, you can only use out-of-memory computing. In general, the complexity of out-of-memory computing is much higher than that of in-memory computing, which also leads to a lot of performance degradation.
So, is there any way to reduce memory footprint while improving computational efficiency?
A common method is to enumerate serial serialization, such as the data from the following fact table:
For enumerated fields such as gender and region, you can establish a corresponding table to convert gender and region values into serial numbers 1, 2, … So that the gender field in the fact table can hold only the corresponding sequence number, and the same is true for the region. The converted data are as follows:
In this way, we can reduce the memory footprint and improve the computational efficiency, because the comparison and grouping of numbers are much faster than strings. When outputting the result, you can convert the sequence number into a string as needed, that is, use the sequence number to find the corresponding record in the code table directly to replace it.
2 order table structure
2.1 Line append
Ordered tables are similar to tables in a database, but they are sequential. Ordinal table data is stored in memory in a contiguous array. In general, when allocating memory to an ordinal table, more space is reserved to cope with possible growth, so as not to reallocate memory every time data is appended, but it is not possible to reserve too much space to waste memory.
Based on this, the frequent addition of records to the ordinal table will result in a continuous increase in the length of the array and the expansion of the space originally allocated to the array. Expanding the memory allocation is not a very simple task, you need to allocate a larger piece of space, and then copy the data in the original space. Finding space and copying data take CPU time and are often more expensive than the operation itself.
Therefore, if you know the number of rows in advance and create the ordinal table at once, you only need to allocate memory once at the beginning. Even if the field values in the ordinal table take some steps to calculate, you should new the table and then modify the field values of the record instead of calculating a row and inserting a row. SPL provides many ways to modify the value of record fields.
Suppose we want to generate an ordered table of Fibonacci numbers with 20 rows and 2 columns. The first column key is the line number, that is, 1Jing 2Jing 3, … The second column value is the value. The rule of the Fibonacci series is that rows 1 and 2 take the value 1, and from row 3, the value is the sum of the first two rows. This operation needs to be implemented step by step, and it is a natural idea to dynamically append data:
However, the order table produces better performance at one time, even though the calculation itself still needs to be implemented step by step:
2.2 column append
Expanding the ordinal table, in addition to appending data row by row, may also change the structure of the data, adding fields in each row of data, that is, the so-called column append. Column append is more complex than row append. The ordinal table itself is a large array, in which each row is a record, and the physical implementation is also an array. Because the data structure rarely changes, the ordered table does not reserve space when generating an array of each row, otherwise memory is wasted too much (because each row has to be reserved). Based on this implementation principle, if there is a column append, the aforementioned reallocation of space will occur, and it will be carried out for each row of records, and then copy the original record data over, you can imagine how much the time cost of this action is. It is often far more than the calculation to be done after appending that column.
SPL provides the ability to append columns to ordinal tables, which brings convenience, but should be used with caution when focusing on performance. When you have to use it, you should, as mentioned above, add all the columns that need to be appended at once, not over and over again. For the columns that cannot calculate the field value at that time, you can fill in the blank value first, and then use other functions to modify the field value.
In the most common case, if you know in advance that you want to derive a new column of xxx after taking out the sequence table from the database, you can write one more null as xxx when writing SQL, so that all the required fields are generated directly when query, and you don't have to do derive again.
For example, to take the field ORDERDATE,AMOUNT from the data table sales and sort it by ORDERDATE, and then append a column to calculate the cumulative values of the AMOUNT. Generally speaking, the natural writing method of reading and then appending columns:
Instead, use the SQL statement to generate the column in a good way:
2.3 reference record
Aiming at the first two optimization ideas of adjusting the order table structure, the starting point is to reduce the copying field value in the new and derive functions. In addition, SPL also supports object references, and the value of the field can be another record. In this way, in SPL, in most cases, it is not necessary to copy the fields in the new result set as SQL does. In order to keep the original whole record to participate in the operation, just write it by reference. This not only has better performance, but also takes up less space.
The above requirement to append AMOUNT cumulative value with derive can be implemented with new function. New creates a new order table, the SRC field references the original record, and the CUMULATE field stores the cumulative value, as follows:
3 cyclic function
3.1 replace loop statements with loop functions
SPL's grid program provides loop statement for and branch statement if to implement complex operation logic. At run time, because the execution order of the grid is dynamically interpreted, a large number of loops are used, which will lead to too many grids to be executed, and a lot of time will be spent on the dynamic interpretation of the grid.
In addition to loop statements, SPL also provides loop functions that can handle most scenarios where you need to use for statements. For the calculation steps are not too complex, high-performance operations should be done by using cyclic functions as far as possible. Similarly, try not to use if statements in scenarios where if functions can be used.
The example of calculating the Fibonacci series listed in section 1.2 can be rewritten as follows:
Where # indicates which record the current cycle to, and the # corresponding to the first record is 1, increasing in turn. Value [- 1] represents the value of the previous record, and value [- 2] represents the value of the previous second record.
Every time the eval function executes, it parses the expression string specified by the parameter into an expression, and then executes it. If the eval function executes in a loop, it takes a lot of time to parse the expression string into an expression too much. If the expression string is not changed, you can use macros instead of eval.
3.2 constants are placed outside the loop
Putting the generation of constants in the loop out of the loop can also help with performance optimization. For example, to select the sales records of Beijing, Shanghai and Shenzhen, the more "natural" words are:
Because the sequence of SPL can be modified, the expressions [Beijing "," Shanghai "," Shenzhen "] produce a new sequence each time it is calculated. If you put [Beijing "," Shanghai "," Shenzhen "] in the loop function select as above, then a sequence of A2 length will be generated during execution. If there are many cycles, these unnecessary operations will consume a lot of time. Therefore, the performance-oriented writing should be as follows:
3.3 be on guard against the cycle
Beware of looping functions in which there are more looping functions, the code looks simple, but after several layers of looping, the actual amount of computation will be magnified geometrically. Although this is common sense, it is sometimes ignored, so what can be done outside the loop should not be put into the loop. In particular, you should be wary of the extremely time-consuming actions of reading files and accessing databases in a loop.
4 code habits
4.1 Free memory
Java performance degrades sharply when it runs out of memory. So to release memory in time, SPL does not delete the variable to release memory statement, just set the value of the variable or cell to empty, you can also use the clear statement to clear a grid. Examples are as follows:
The cell that starts with = is the calculation cell, the return value of the expression is saved on the cell, the cell that begins with > is the execution cell, and the return value of the expression is not saved. Cs.select and cs.join add operations to cursors and do not generate new cursors, so the returned values do not need to be saved. A7 releases the read PART data, or you can use clear statements to empty all the cell values between A1 and A5. You only need to replace the A7 code as follows:
4.2 Compact Code
The code blocks of for and if can be written directly on the same line, and there is no need to change the line as Java does. SPL's network has been able to split these statements clearly. It also takes time for the interpreter to scan white space, so for programs with loop statements, if the number of loops is particularly high, you should make the code compact and delete blank rows and columns to reduce the number of cells, so as to improve the efficiency of the interpreter.
Let's take getting the first sales record every day as an example to introduce the code block rules of SPL. Sales is the cursor parameter of the sales record and is ordered by ORDERDATE.
The code block of the cell is the row in which the cell is located and the lower and lower left cells are blank rows. In the above example, the code block of A2 for is [B2:F5]. The code block of B2 lattice if is [C2: F2], the cell B3 of the next line of the if code block in the same column as the if is else, and the cell on the left side of B3 is blank, then B3 lattice is the else branch of B2 lattice, and the code block of B3 lattice is [C3: F5]. The else can also be written on the cell to the right of the if along with the corresponding if.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.