In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article focuses on "what is the simple performance testing method of GreenPlum". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the simple performance testing method of GreenPlum"?
I. brief introduction to the principle of TPC-H
TPC-H is a set of test benchmark for database decision support capability published by TPC (Transaction Processing Performance Council) transaction performance Committee. It examines the comprehensive processing capacity of the database by simulating complex business-related queries and parallel data modification operations in the database, and obtains the response time of database operations and the number of queries executed per hour (QphH@Size).
A database model is defined in the TPC-H benchmark model, and the capacity can be selected from among the eight levels of 1GB~10000GB. The database model includes 8 data tables of CUSTOMER, LINEITEM, NATION, ORDERS, PART, PARTSUPP, REGION and SUPPLIER, involving 22 complex select query flow statements and 2 update flow statements with insert and delete program segments.
two。 Purpose
1. Compare the performance difference between GreenPlum with distributed attributes and stand-alone mysql in TPC-H class testing under the same resource conditions.
two。 Analyze the reasons for the performance difference between the two DB.
three。 Test environment and configuration information
Test environment: Tencent Cloud
Test objects: GreenPlum and Mysql. The configuration information of both is as follows:
Metric parameter text 1 text 2 operating system CentOS 6.7 64-bit cpuIntel (R) Xeon (R) CPU E5-26xx v3 8 core memory 24GB public network bandwidth 100MbpsIP123.207.228.51 version MySql5.6
Table 2 Mysql Server
four。 Test data quantity statistics table name number of data items customer150000lineitem6001215nation25orders1500000part200000partsupp800000region5supplier10000
Table 3 data statistics of each test table
five。 Execution time Statistics sqlGeenPlum execution time (in seconds) Mysql execution time (in seconds) Q14.0112.66Q20.503.27Q31.355.06Q40.110.01Q50.1927.29Q60.012.50Q76.0610.79Q81.4639.78Q94.00 > 12 hours Q100.144.74Q110.307.90Q120.082.35Q131.04 > 12 hours Q140.049.37Q150.074.76Q160.512.90Q173.2148697.95Q1814.23 > 12 hours Q190.9523.12Q200.16 > 12 hours Q217.23 > 12 hours Q220.68540.22
Table 4 22 sql execution time statistics
six。 Performance comparison and analysis
According to the statistics of execution time, we can see that there are great differences in TPC-H testing between the two databases. Here we will select two typical cases, SQL, to analyze the reasons for the performance differences between GreenPlum and Mysql in executing this kind of SQL.
Example one
We choose Q3, from the execution time statistics, we can see that the execution speed of GreenPlum is about 4 times that of Mysql. First, look at the Q3 statement, as shown in figure 1 below.
Figure 1 Q3 statement
Then, under explain, Q3, the results are shown in figures 2 and 3, respectively.
Figure 2 result of GreenPlum executing explain Q3
Figure 3 result of Mysql executing explain Q3
As can be seen from the above explanation of the execution process, the main execution steps on GreenPlum are:
Do conditional query Filter on all segment (4 in this case) at the same time
When the two tables are associated, the data will be broadcast, and each segment will broadcast the query results to all other segment, and each segment will get all the results (full data) of the table Filter, followed by a hash.
Do hash join on all segment at the same time, because you have to do join with other tables, and will continue to broadcast the results to all segment.
Perform group by aggregation operations. First, HashAggregate aggregation is performed on all segment according to the group by condition (in order to reduce the amount of redistributed data), then the resulting data is redistributed according to the group by field, and finally, each segment is aggregated again according to the condition to get the final result.
According to the order by condition, sort is performed on all segment at the same time, and the data is selected according to the Limit condition. Here is Limit 10. Each segment selects 10 pieces of data to be collected on the master, and master selects the first 10 items.
Merge is carried out, and all segment sends the results to master, and master merges them once. The first 10 pieces of data of the results are selected according to Limit conditions and returned.
The main time-consuming points of the whole process are:
Two broadcasts were made, with a total of (30178, 144314, 174492) 170000
According to the condition of group by, the number of Redistribute is about 80, 000.
Hash join twice, the hash join is carried out between two tables. On a single segment, the amount of hash join between the two tables is about 180000 and 30, 000, 840000 and 140000 respectively.
Sort once, the sort of a single segment fetches the top 10 records from 80, 000 pieces of data.
The execution process of Mysql is relatively simple. The first step is to do a where filter in the lineitem table to obtain the result and calculate the order by value. Because the value of order by is revenue, you need a non-keyword (revenue) sort, the amount of sorting is 3271974 (about 3.2 million), which is very time-consuming. Then do some where filtering in the order table and customer table.
As can be seen from the above execution process, the main time-consuming point should be on the sort operation. GreenPlum is the sort with 80, 000 records on all segment at the same time, while Mysql is the sort with 3.2 million records directly. Because Mysql is built on a single server, the performance of this server (8-core CPU, 24GB memory) is much higher than that of GreenPlum's single segment (1-core CPU, 4GB memory). Therefore, if you make full use of the performance of the server, the sort time of the two should not be much different, but is this really the case? Next, let's take a look at the CPU usage of the server where Mysql resides, and the results before and after executing Q3 are shown in figure 4:
Figure 4 CPU usage time of Mysql server before and after Q3 execution
It can be seen that before and after the execution of Q3, only the usage time of CPU6 has changed greatly, and the change time is about 500jiffies, that is, 5 seconds, which is basically consistent with the total sql execution time (5.06 seconds). Therefore, during the execution of Q3, the server where mysql is located uses only one CPU for calculation.
To sum up, the time-consuming difference between Mysql and GreenPlum is mainly reflected in the sort operation. Mysql does a sort of 3.2 million records, but can only use a single CPU calculation, which does not give full play to the performance advantages of the server's multi-core CPU, and the overall execution time is longer. Because GreenPlum adopts a distributed structure, each segment corresponds to a CPU, the data is evenly distributed to each segment, and Filter, hash join, group by,sort and other operations are executed in parallel at each node, making full use of the computing power of each CPU, and then broadcast the results, or redistribute the whole data and then calculate, and finally merge the result data of each segment by master. When broadcasting or redistributing, data will be transferred between segment nodes, which takes a certain amount of time, but because GreenPlum optimizes sql better and has the ability of parallel computing, the total execution time is shorter than Mysql.
Example two
Let's choose another typical case-Q17. According to the execution time statistics, the execution time of Mysql is 15000 times that of GreenPlum, which is a considerable gap! What makes such a big difference? let's first look at the sql statement of Q17 as shown in figure 5 below.
Figure 5 Q17 statement
Unlike Q3, Q17 involves subqueries. Still, we sql under explain on mysql and GreenPlum, and the results are shown in figures 6 and 7.
Figure 6 result of GreenPlum executing explain Q17
Figure 7 result of Mysql executing explain Q17
The subquery sql (select l_partkey as agg_partkey, 0.2 * avg (l_quantity) as avg_quantity from lineitem group by l_partkey) involves group by. Let's take a look at the difference between the two in terms of aggregation:
Mysql: because group by is a non-indexed keyword, filesort lineitem is done directly (6 million records).
GreenPlum: first do a group by l_partkey in each segment (with 1.5 million records in the table), using a more efficient way of HashAggregate aggregation. So that all segment can do join in parallel, the data from the lineitem table will be redistributed once (50, 000 records), and each segment will get the hash distributed to its own records.
It can be seen that the efficiency of Mysql in polymerization is significantly lower than that of GreenPlum.
Then, the result of the subquery will be join with the current table, so let's continue to look at the difference between the two on join:
Mysql: join the subquery result as a temporary table (200000 records) and the current table lineitem (6 million records) directly, which will produce a data volume of 6 million × 200000 = 1.2 trillion.
GreenPlum: first of all, the sql is optimized. First, the where condition is implemented, and the number of data in the part table is reduced to 260. the total amount of a single segment is 4 × 260. then the amount of data is of a single segment.
A more efficient join method, hash join, is adopted.
If you use temporary tables and lineitem tables to hash join directly, it will generate about 500000 of the data, but GreenPlum does not do so, but uses part tables for join, because the amount of data of part tables filtered by where is very small, and part tables do hash join, the amount of data is also relatively small. A total of two hash join have been done:
Part table and temporary table part_agg, producing 246 items of data
Part table and lineitem table, producing 2598 items of data
Compared with the two, the data volume of join by GreenPlum is (2462598) × 11376, which is much less than that of 1.2 trillion by Mysql. The performance of the two is self-evident.
To sum up, when executing Q17, the efficiency difference between Mysql and GreenPlum is not only the parallel computing ability of GreenPlum, but also the optimization of aggregation and association operations.
At this point, I believe you have a deeper understanding of "what is the simple performance testing method of GreenPlum". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.