In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Sorting calculation is a very resource-consuming operation, especially for big data sorting, if the memory cannot hold the data, the conventional practice needs external memory, but it will also increase the read and write operations on the data. read and write operations are usually more resource-consuming than sorting operations.
The SPL sorting optimization techniques introduced in this paper not only provide conventional sorting algorithms, but also provide alternative sorting algorithms according to the data characteristics of different scenarios, so as to reduce the number of comparisons and the amount of IO, and improve the performance of operations.
1 memory sort
When data can be easily loaded into memory, you can use SPL's memory sorting function, such as A.sort (). The default sorting algorithm of SPL is based on the multithreaded sorting algorithm of merge sort, that is, the optimization is mainly achieved by increasing the number of threads. The actual number of threads used is specified by the maximum number of parallelism in the aggregator configuration. The sample code is as follows:
Number of AB1=5000*1000/ elements 2=A1\ 1000 / maximum random number 3=to (A1). (rand (A2)) / generate random sequence 4=now () / current time 5=A3.sort () / ascending sort 6=interval@ms (A4 paint now ()) / time spent in sorting
The test machine CPU used in the test is Core i7, 4 core 8 thread. Depending on the [maximum number of parallelism] configuration, the test results are as follows:
Maximum number of parallelism average time (milliseconds) 1 (i.e. single thread) 180048008660
It can be seen that multi-core CPU or multi-CPU computers can make full use of the parallel computing power of each core and significantly improve the sorting performance through multi-thread sorting.
In this example, the average repetition of each value is 1000, and for the A.sort () function, the number of repeats has little effect on performance. However, when the number of repetitions is large, we can also sort through the grouping method A.group@s () to further improve the performance. This method first uses the hash method to group the elements, then sorts the groups, and finally merges the sorted groups to get the sorting results. The sample code is as follows:
Number of AB1=5000*1000/ elements 2=A1\ 1000 / maximum random number 3=to (A1). (rand (A2)) / generate random sequence 4=now () / current time 5=A3.group@s () / each value has an average of 1000 duplicates, using the grouping method to sort 6=interval@ms (A4 now ()) / time spent in ascending order
After sorting using the grouping method, the average time spent is 360 milliseconds. It can be seen that this method is suitable for repeating a large number of data, and the more the number of repeats, the better the performance.
2 out-of-memory sorting
When the amount of data is too large to be loaded into memory, it needs to be sorted with the help of external memory. Out-of-memory sorting will read in the data step by step, sort and write out to temporary files, and finally merge all the generated temporary files to get the final result.
The number of temporary files will affect the efficiency of sorting, if the number is too large, the merging stage will take up more resources, and too many merging routes will also affect the efficiency of merging. So reading more data each time can improve the performance of sortx.
The function of SPL out-of-memory sorting is cs.sortx (). The number of temporary files generated in the sorting is determined by the amount of original data and the amount of data read each time, and the amount of data read each time can be specified by the parameters of the sortx function. Users can specify a reasonable amount of data per read according to the record size and free memory capacity to achieve optimal performance. If not specified, SPL estimates an approximate value based on the available memory of the virtual machine.
The release of temporary files is triggered when the result cursor fetch ends or when the close method is called.
Testing out-of-memory sorting requires a large amount of data, and because the efficiency of using JDBC to fetch numbers from the database is very inefficient, this paper will use a more efficient set file for testing.
The test data simulates the order data. The table structure is {order_id, order_datetime, customer_id, employee_id, product_id, quantity, price}, and 20 million records are randomly generated according to order_id and order_datetime.
SPL is also used to generate test data, and the creation code is as follows:
AB12018-01-01 00:00:00=file ("orders_2018.btx") 2=to (100000000) 03for 20=A2.new (B2+~:order_id, elapse@s (A1 journal orderkeeper id): order_datetime, rand (1000000) + 1:customer_id, rand (1000) + 1:employee_id, rand (10000) + 1:product_id, rand (1000) + 1:quantity, rand (100000000) / 100:price) 4
= B1.export@ab (B3) 5
> B2=B2+A2.len ()
The code for the sort calculation is as follows:
AB1=now ()
2=file ("orders_2018.btx") .cursor@b () / the fetch cursor 3=A2.sortx (customer_id; 2000000) / a pair of cursors to generate the order set file are sorted by customer code. The second parameter is the amount of data read in 4=file ("orders_customer_2018.btx") .export @ b (A3) / the sorting result is exported to the set file 5=interval@s (A1 now ()) / takes time (in seconds)
The test results are as follows:
It takes time to read data each time (seconds) 20,73.2 million and more than 2163 are merged.
When using SPL to deal with big data operation, in order to obtain better performance, the data is usually externalized into set files or group tables, and sorted according to the commonly used filtering dimensions to achieve higher filtering performance. In this way, if there is new data to be appended to the history file, and all the data needs to be reordered, we can take advantage of this feature of set files or group tables. As the historical data has been ordered, we can first sort the new data by dimension, and then merge with the historical data to get the final ordered data. The amount of data involved in sorting using this method is much less than the amount of data that does a large sort of all data. The specific tests are as follows:
In the previous out-of-memory sorted numbering code, change A1 to 2017-01-01 00 orders_2017.btx 001 to = file ("orders_2017.btx") to generate a simulated 2017 order data file "orders_2017.btx". Then use the out-of-memory sort code to sort by the customer_id field to generate the sort result file "orders_2017_customer.btx".
Then, we need to sort all the data in 2017 and 2018 as a whole, that is, we use the merge method to merge the orders_2017_customer.btx and orders_2018_customer.btx files into a new ordered file. The sample code is as follows:
AB1=now ()
2=file (orders_2017_customer.btx) .cursor@b ()
3=file (orders_2018_customer.btx) .cursor@b ()
4 = [A2Magazine A3] .mergex (customer_id) / a pair of cursors ordered by customer_id are merged to form a new cursor ordered by customer_id 5=file ("orders1.btx") .export @ b (A4) / export the sorted data to the set file 6=now () = interval@s (A1 menus A6) 7=file ("orders_2017_customer.btx") .cursor@b ()
8=file (orders_2018_customer.btx) .cursor@b ()
9 = [A7 orders2.btx A8] .conjx () .conjx () .conjx (customer_id; 2000000) / Vertical join two cursors and sort 10=file ("orders2.btx") .export @ b (A9)
11=now () = interval@s (A6 Magi A11)
The first six lines of code use the merge method, which takes 30 seconds, while the last five lines of code simulate a simple large sorting method, which takes 133 seconds.
4 pre-semi-ordered
If you need to sort the data by multiple fields, and the data is already sorted by several fields in the sorted field, you can group the data into the sorted fields and output them after sorting in memory. This method has one less read and write operation than cs.sortx (), so its performance is significantly better than cs.sortx ().
For example, the data of the order table is generated sequentially by date and time, and you can use this method if you want to sort the order table by date and customer fields. The sample code is as follows:
AB1=now ()
2=file (orders_2017.btx) .cursor@b ()
3=A2.run (order_datetime=date (order_datetime)) / convert date time to date type 4=A2.group@qs (order_datetime;customer_id) / data are sorted by order_datetime, and data are sorted by order_datetime,customer_id 5=file ("orders_2017_date_customer.btx") .export @ b (A4)
6=interval@s (A1 focus now ())
The test time is 38 seconds (the value of A6 grid), while replacing the A4 grid expression with the following code takes 95 seconds.
AB4=A2.sortx (order_datetime,customer_id;2000000) / data are sorted by order_datetime, and data are sorted by order_datetime,customer_id sort 5 index
SPL's group tables provide the ability to create indexes for certain columns, and some commonly used columns can also be stored in the index, so that if the accessed columns are in the index, there is no need to access the entire source file, thus saving a lot of IO operations. If the sorted field is an index field, and the fields that need to be accessed are all in the index, you can take advantage of the ordering of the index and use T.icursor@s () to return the ordered cursor.
The code to create the index is as follows:
AB12018-01-01 00:00:00=file ("orders.ctx") 2=to (100000000) 03=B1.create@y (# order_id,#order_datetime,customer_id,employee_id,product_id,quantity, price)
4 / the following loop is to create a group table data program
5for 20=A2.new (B2+~:order_id, elapse@s): order_datetime, rand (1000000) + 1:customer_id, rand (1000) + 1:employee_id, rand (10000) + 1:product_id, rand (1000) + 1:quantity, rand (100000000) / 100:price) 6
= A3.append (B5.cursor ()) 7
> B2=B2+A2.len () 8=A3.index (PriceIndex;price;product_id,order_datetime) / index by amount field while maintaining the value of the product and date fields
Using the index to generate an ordered cursor code is as follows:
AB1=now () = file ("orders.ctx") .create () 2=B1.icursor@s (order_datetime,product_id,price;true,PriceIndex) / using the index to return ordered cursor 3for A2 5=B1.cursor 10000 / traversal cursor data 4=now () = interval@ms (A1 Magi A4) 5=B1.cursor (order_datetime,product_id,price)
6=A5.sortx (price) / out-of-memory sort 7for A6 10000 / traversal cursor data 8=now () = interval@ms (A4Magi A8)
It takes 4 seconds for ordered cursors (B4 grid) and 54 seconds for out-of-memory sorting (B8 grid values).
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.