Application and performance Test of Parquet in Spark 07/12 Update SLTechnology News&Howtos

Application and performance Test of Parquet in Spark

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Application of Parquet in Spark

Introduction to Parquet

Parquet is a column storage format for analytical business, which was jointly developed by Twitter and Cloudera. It graduated from the incubator of Apache and became a top-level project of Apache in May 2015.

Http://parquet.apache.org/

Spark support for Parquet

The version we are using here is spark2.0.1, which is the latest version released on October 3, 2016.

Spark can use and generate Parquet files very well. The screenshot below is from the official document.

In the example above, spark reads a Parquet file located in the examples/src/main/resources/users.parquet folder, and the data is filtered and saved in the namesAndFavColors.parquet folder. Note that the official document path is named with .parquet, which may be misunderstood as a file, but actually a folder, which can be confirmed by trying here.

Spark also supports the conversion of jdbc data into Parquet files. In the following example, we convert Test Table 1 in SQLserver into Parquet files. The code is as follows

The generated file is shown in the following figure. The experimental environment here is in Windows,spark local mode. You can see that the file name format is * .snappy.parquet. The snappy here represents the compression method. Of course, there are many options for this compression method, but here spark chooses to compress the parquet file with snappy compression as the default strategy.

Let's take a look at the information of Test Table 1 in SQLserver, as shown in the following figure:

You can see that this is a 7000w table with a table size of 6.5G, a compressed size of 768m, and a compressed size of 11% of the original file size, saving 89% of the space. The whole compression time is about 11min. For big data platform, storage space is also a very important resource, and it has a great improvement for network transmission. In distributed computing, network transmission sometimes becomes a performance bottleneck.

Let's do the experiment with another test table 2.

This is a 250 million-level table with a table size of 9G and a compressed size of 3.99G, saving 56% of space and consuming about 17min. This is because the file size of the column storage format is not only related to the number of rows, but also related to specific data, and different data will have different compression ratios.

Spark Sql supports reading Parquet files directly in sql statements, as shown in the following figure

Note that this syntax is a new feature supported by spark2.0 since the beginning, and with this feature, we can skip the process of creating a table and read the data of the file directly.

Performance dependent

Let's talk about performance.

We used three scenarios to test the performance of Parquet. Instead of fetching the parquet file directly, we created it as a view using the createOrReplaceTempView method. The test results are as follows.

SELECT YEAR (LOGOUT_DT) YR, MONTH (LOGOUT_DT) MTH,Modename,sum (WinCount+LoseCount+DrawCount) GameCount,sum (GameTime) GameTime,sum (GameTime) / sum (WinCount+LoseCount+DrawCount) Avg_GameTime FROM Test Table 1 WHERE LOGOUT_DT BETWEEN '2015-01-01' AND '2016-01-01' GROUP BY YEAR (LOGOUT_DT), MONTH (LOGOUT_DT), Modename limit 1000

Let's take another look at a comparative experiment:

Phoenix (poc environment, 10 Aliyun, cluster environment)

110s

Spark local (8G par 4 core, i3-4170, stand-alone mode)

52s

Spark 3node (8G core, i3-4170, cluster environment)

12s

Spark 5node (8G core, i3-4170, cluster environment)

12s

Hive general storage 5node (8G quad core, i3-4170, cluster environment)

133s

Hive column storage 5node (parquet) (8G focus 4 cores, i3-4170, cluster environment)

43s

Parquet can improve not only the query speed of spark, but also the query speed of hive.

The computing speed of the cluster is faster than that of the stand-alone machine (the same machine configuration)

Increasing the number of computing nodes does not necessarily improve the computing speed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.