In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
When the amount of data is huge, using a large number of random key value sets to obtain the corresponding record set, not only tests the database software itself, but also lies in the programmer's understanding of the data! How to maximize performance with limited hardware resources?
This test mainly aims at the batch key number performance of the aggregator group table index, and compares it with Oracle on the same scale.
I. Test environment
2. Data description
2.1 data structure
2.2 data scale
According to the above data structure, the row inventory table file of 600 million records and the corresponding index file are created:
Third, the testing process
3.1 generate test files
3.1.1 create a group table
A1: a string containing 26 English letters and 10 Arabic numerals.
A2, A3: create a group table file with a structure of (id,data), and the @ r option indicates the use of row storage.
A4: cycle 6000 times, loop bodies B4 and B5, generate 100000 records of the corresponding structure each time, and append them to the group table file.
After execution, the group table file is generated: id_600m.ctx
3.1.2 indexing
A2: establish the index of the group table according to the id column of the group table file.
After execution, the index file of the group table is generated: id_600m.ctx__id_idx
3.2 query Test
A2: loop 10,000 times, each time get a random one in the id column of the corresponding group table file, and sort it. (there may be a small number of duplicate values, but it has little effect on the test)
A4: in the icursor () function of the group table, the index id_idx is used to filter the group table with conditional A2.contain (id). The aggregator will automatically recognize that the A2.contain (id) condition can be indexed, and will automatically sort the contents of A2 and find them forward and backward.
3.3 Strange phenomena
It was originally hoped that an average would be obtained as the test result after multiple execution. However, it is found that each time the test code is executed, it will be faster than the last time. Here is a list of 5 test queries since the first execution of the code:
It is too laborious to manually click the execute button in the designer and record the query time. To find the pattern, change the code to the following form:
B7: record the time spent on each query of the icursor () function in the loop body in A1.
During execution, the time spent on the newly appended query in A1 was compared with the previous one, and it was found that it was close to the limit of 25 seconds after about 350 cycles. In the follow-up nearly a thousand cycles, the query time is also the same, basically stable.
Is it because the aggregator caches the data? Skeptically, the aggregator designer was restarted and the query code was executed again. It is found that the first query after restart also takes 25 seconds. So it seems that the reason for the speed increase is not directly related to the aggregator itself.
On the other hand, it can be thought that based on the amount of data currently tested, the query can be completed in a short time, and some of the data may have been loaded into memory, then it is likely that the file cache of the linux operating system has caused this phenomenon. After restarting the server, the query was executed through the aggregator designer, and it was found that the time consuming began to decrease from about 80 seconds.
In further testing, the linux free command was used to view the system memory usage. It is found that each time a query of a group table is completed, one of the cached items becomes larger. As the cached grows larger, the query time is gradually reduced.
The use of 3.4index@3
After querying some information on the network, I learned that there will be cache memory in Linux, which is usually called Cache Memory. To see one of the cached items using the free command, execute free-h:
When we read and write files, the Linux kernel will cache files in memory in order to improve the efficiency and speed of reading and writing. This part of memory is Cache Memory (cache memory). Even after our program is finished, Cache Memory will not be released automatically. This will cause us to read and write files frequently in the Linux system, and we will find that there is very little physical memory available. In fact, this cache is automatically freed when we need to use memory, so we don't have to worry about running out of memory. There is also a way to manually release Cache Memory, but I will not discuss it in detail here.
This function involves 111 gigabytes of data, which is larger than 64 gigabytes of physical memory of the machine, so it is obviously impossible to cache all the data in memory, so which data can be cached to steadily improve query performance? Is it possible to cache the required data in advance to achieve high performance? After consulting the master, I found that there was an option to pre-cache the index of the index. Before using the icursor () function to query, we used T.index@2 (idx) and T.index@3 (idx) on the group table index. The code is as follows:
The index of the aggregator has a hierarchical cache, and @ 3 means that the third-level cache of the index is loaded into memory first. After index@3 pre-processing, the first query time can also reach the limit after hundreds of queries above.
Comparison with Oracle
The test environment, data structure and scale are the same as above, and the test objects are as follows:
The Oracle table statement is as follows:
Create table ctx_600m (id number 13, data varchar2 200)
After the data is generated by the aggregator into a text file of the same structure, it is imported into the table using Oracle's SqlLoader.
The Oracle indexing statement is:
Create unique index idx_id_600m on ctx_600m (id)
When using Oracle for batch random fetch testing, we use a SQL like this:
Select * from ctx_600m where id in (…)
The aggregator script for querying using a single-threaded connection to Oracle is:
Due to the limit of the number of in in oracle, the script is executed in batches and merged.
The aggregator script that uses 10 threads to connect to Oracle for query is:
The aggregator script for querying the row save group table using a single thread is:
The aggregator script that uses 10 threads to query the row save group table is:
Take 10,000 random key values in batches from a total of 600 million pieces of data, and index the test results:
5. Inventory index test
The aggregator column storage adopts the algorithm of data segmentation and compression, so that for traversal operation, the amount of data accessed will be smaller, and it will have better performance. However, for the scenario of random number fetching based on index, because there is an additional decompression process, and each fetch will be aimed at the whole block, the operation complexity will be much higher. Therefore, in principle, the performance at this time should be worse than that of row memory.
In the above code, the create () function that generates the group table can generate the column storage file without the @ r option. By repeating the above calculation, it takes 129120 milliseconds to fetch 10,000 rows out of 600 million rows in a single thread, which is more than 6 times slower than the row memory mode. However, the average line is only 13 milliseconds, which is still practical for most single-entry scenarios.
The same data can not achieve the optimal performance in both ergodic operation and random number, so in practical application, it is necessary to make a choice according to the demand. When we must pursue the limit performance of all kinds of operations, we may have to make multiple copies of the data.
VI. Index redundancy mechanism
The aggregator does provide a redundant indexing mechanism that can be used to improve random access performance of column data, as follows:
When indexing a group table, when the index function has a data column name parameter, such as data in this example A2, the data column data is copied into the index when indexing. When there are multiple data columns, it can be written as: index (id_idx;id;data1,data2,...)
Because of the redundancy in the index, the index file will naturally be larger. The file size of the column storage group table and index redundancy tested in this article is as follows:
When the data is copied into the index, the original data file is no longer accessed when actually read.
Take 10,000 random keys from 600 million pieces of data, and compare the complete test results:
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.