Parquet performance Test and tuning and its Optimization suggestions 07/01 Update SLTechnology News&Howtos

Parquet performance Test and tuning and its Optimization suggestions

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Parquet performance Test and tuning and its Optimization suggestions

First, why do we choose parquet

1. The external factors of choosing parquet

(1) We are already using spark clusters. Spark already supports parquet and recommends its storage format (parquet by default)

(2) hive supports parquet format storage, and HiveSql queries are fully compatible.

2. The reason for choosing parquet

The main results are as follows: (1) because the members of each column are isomorphic, parquet can use a more efficient data compression algorithm for different data types. CSV format generally does not compress, and saves space effectively by storing data through parquet. Without considering backup, the compression ratio is nearly 27 times. (parquet has four compression methods: lzo, gzip, snappy and uncompressed, of which the default gzip compression method has the highest compression ratio and the fastest compression and decompression rate.)

(2) when querying, you do not need to scan all the data, but only need to read the columns involved in each query, which can reduce the Ithumb O consumption by N times, and save the statistical information of each column (min, max, sum, etc.)

(3) in partition filtering and column pruning, parquet combined with spark can realize partition filtering (filter and where keywords of spark sql,rdd). Column pruning is to get the required columns, and the fewer the columns are, the faster the query rate is.

Because of the isomorphism of the members of each column, an encoding method that is more suitable for CPU pipeline can be used to reduce the cache invalidation of CPU.

Parsing of parquet's column storage format (only understand)

The distribution of Parquet files on disk is shown in the figure above. All the data is split horizontally into Row group, and a Row group contains the column chunk of all columns in the corresponding interval of the Row group. A column chunk is responsible for storing a column of data, which is the Repetition level,Definition level and Values of that column. A column chunk is made up of Page, and Page is a unit of compression and coding that is transparent to the data model. An Parquet file ends with Footer, which stores the file's metadata and statistics. Row group is a cache unit for data reading and writing, so it is recommended to set a larger Row group to bring a greater degree of parallelism, of course, it also needs a large memory space as a cost. In general, it is recommended to configure a Row group size of 1G, a HDFS block size of 1G, and a HDFS file containing only one block.

Parquet performance test

(1) Test the performance of normal files and parquet file read columns

① test environment: 58.56 machine, spark1.6, sts, hive, etc.

The purpose of ② testing is to verify the performance of spark in reading ordinary files and parquet files, which is more efficient than ordinary files at the rate of reading the same columns, and will decrease with the increase of columns.

The principle of ③ testing:

Column storage runs faster than row storage for some operations due to the following characteristics:

The main results are as follows: (1) because the members of each column are isomorphic, a more efficient data compression algorithm can be used for different data types.

(2) because of the isomorphism of the members of each column, we can use a coding method that is more suitable for CPU pipeline to reduce the cache invalidation of CPU.

④ test steps

(1) use C_PORT table to create hive table, also create a C_PORT_PARQUET, and use stored as parquet to store the table in parquet format

(2) write spark read statement to query the number of columns.

(3) increase the number of read columns and submit task running record running time on the machine by spark.

(4) compare the running time and draw the final conclusion.

⑤ test results

About 27005w data ordinary hive table request table test results:

Number of query columns

Ordinary hive meter consumes time

Time consuming of Parquet meter

1 column

2 minutes 53 seconds

2 minutes 42 seconds

5 columns

3 minutes 53 seconds

1 minute 27 seconds

20 columns

5 minutes 58 seconds

3 minutes 56 seconds

35 columns

9 minutes 16 seconds

9 minutes 36 seconds

50 columns

13 minutes, 19 seconds.

8 minutes 11 seconds

⑥ summary conclusion

Through the reading of the above five groups of data columns, we know that the reading time increases with the increase of the number of columns, and the reading rate is similar to that of parquet and ordinary hive, so when the number of columns is large, it is recommended to use parquet storage to increase the reading efficiency.

(2) Test the efficiency of parquet formulation storage in the calculation of multiple columns of data.

① test environment: 58.56 machine, spark1.6, sts, hive, etc.

The purpose of the ② test is to verify the performance of spark in reading ordinary files and parquet files, and the performance of certain determinant storage is better, that is, the read calculation speed is faster.

The principle of ③ testing:

Column storage runs faster than row storage for some operations due to the following characteristics:

(1) when querying, you do not need to scan all the data, but only need to read the columns involved in each query, which can reduce the Ibind O consumption by N times. In addition, you can save the statistical information of each column (min, max, sum, etc.), and implement partial predicate push down.

(2) because the members of each column are isomorphic, we can use a more efficient data compression algorithm for different data types to further reduce the Imap O.

(3) because of the isomorphism of the members of each column, we can use a coding method that is more suitable for CPU pipeline to reduce the cache invalidation of CPU.

④ test steps

(1) use C_PORT table to create hive table, also create a C_PORT_PARQUET, and use stored as parquet to store the table in parquet format

(2) write spark read statements, including sum,avg and max,min statements for column calculation

(3) spark submit tasks on the machine to record the running time.

(4) compare the running time and draw the final conclusion.

⑤ test results

Group 1:

About 27005w data ordinary hive table request table (grouped according to daily hours, 2 summation, 3 averaging operations)

Test results:

time

Ordinary hive table

Parquet table

Time-consuming

2 minutes 14 seconds

1 minute 37 seconds

Time-consuming

2 minutes 24 seconds

1 minute 08 seconds

Time-consuming

2 minutes 27 seconds

1 minute 36 seconds

Average time-consuming

2 minutes 33 seconds

1 minute 27 seconds

Group 2:

About 27005w data ordinary hive table request table (grouped according to daily hours, 2 summation, 3 average operation, 2 maximum, 2 minimum)

Test results:

time

Ordinary hive table

Parquet table

Time-consuming

2 minutes 22 seconds

1 minute 38 seconds

Time-consuming

2 minutes 58 seconds

1 minute 51 seconds

Time-consuming

2 minutes 31 seconds

1 minute 38 seconds

Average time-consuming

2 minutes 37 seconds

1 minute 42 seconds

Group 3:

About 27005w data ordinary hive table request table (grouped according to daily hours, 4 summation, 4 average operation, 4 maximum, 4 minimum)

Test results:

time

Ordinary hive table

Parquet table

Time-consuming

3 minutes 03 seconds

1 minute 58 seconds

Time-consuming

2 minutes 45 seconds

2 minutes 03 seconds

Time-consuming

2 minutes 48 seconds

2 minutes 06 seconds

Average time-consuming

2 minutes 52 seconds

2 minutes 02 seconds

⑥ summary conclusion

Through the comparison of three groups of values, the column storage format parquet has obvious advantages over ordinary row storage in terms of column computing efficiency, and the operation efficiency is improved by about 30%, 40%, higher efficiency and faster execution efficiency.

Test the compression efficiency comparison between ordinary files and parquet files

① test environment: 58.56 machine, spark1.6, sts, hive, etc.

The purpose of ② testing is to verify the compression efficiency of test ordinary files and parquet files. When the same data is compressed and stored, the compression efficiency of parquet files is higher and takes up less space.

The principle of ③ testing:

The main results are as follows: (1) because the members of each column are isomorphic, a more efficient data compression algorithm can be used for different data types.

(2) because of the isomorphism of the members of each column, we can use a coding method that is more suitable for CPU pipeline to reduce the cache invalidation of CPU.

④ test steps

(1) if the same SparkSql runs, the storage mode is different. Generate parquet files and normal file storage with the same amount of data

(2) check the size of the generated Parquet file and the ordinary file respectively, and compare the results.

⑤ test results

The result is as follows:

After the final execution result, the total size of ordinary files is 12.6g, and the size of parquet files is 3.6g, and the storage space is reduced by nearly 70%, so the storage for parquet files takes up less space.

IV. Suggestions on the application of Parquet in practical projects

(1) when not all columns are read, it is recommended to store them in parquet format (use stored by parquet when creating tables)

(2) when carrying out column calculation or vector calculation, it is also recommended to use parquet format for storage, which can improve the operation efficiency.

(3) if there are files that need backup and storage, parquet files can be used for compression, which can effectively save space and improve compression efficiency and speed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.