What are the five reasons for choosing Parquet for Spark SQL? 04/19 Update SLTechnology News&Howtos

What are the five reasons for choosing Parquet for Spark SQL?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces to you what are the five major reasons for choosing Parquet for Spark SQL, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.

Column storage (columnar storage) can effectively save time and space when dealing with big data. For example, Parquet improves Spark SQL's performance by an average of 10 times compared to using text, thanks to rudimentary reader filters, efficient execution plans, and improved scan throughput in Spark 1.6.0! The editor will give you a detailed introduction to the five reasons for the advantages of using Parquet for Spark SQL.

To see how powerful Parquet is, we selected 24 queries derived from TPC-DS from spark-perf-sql to complete the comparison (a total of 99 queries, some of which cannot be used for flat CSV data files at the scale of 1TB. See below for more information. These queries represent all categories in TPC-DS: reports, ad hoc reports, iterations, and data mining. We also need to make sure that we include short queries (queries 12 and 91) and long-running queries (queries 24a and 25), as well as well-known queries that use 100% CPU (query 97).

We used a 6-node preset Cisco UCS cluster, with each Cisco validated design with a similar configuration. We tuned the underlying hardware to prevent network or disk IO bottlenecks in all tests. The focus of this article is to understand the performance difference between running these queries against only text and Parquet storage formats in Spark 1.5.1 and the just released Spark 1.6.0. The total Spark working store is 500GB. The TPC-DS scale is 1TB.

1. Spark SQL is faster for Parquet!

The following figure compares the sum of all execution times for running 24 queries in Spark 1.5.1. When using a flat CVS file, the query took about 12 hours to complete, while with Parquet, the query took less than an hour to complete, resulting in an 11-fold improvement in performance.

Figure 1. Compare the total query time (in seconds) spent in text and Parquet, as small as possible.

2. The performance of Spark SQL is better than that of Parquet when using large scale.

Improper selection of storage formats often makes it difficult to diagnose and repair. For example, if you use a flat CSV file when using the scale of 1TB, at least 1 of the queries that can be run cannot be completed, but when using Parquet, these queries are completed.

Some errors and anomalies are very mysterious. Here are three examples:

Error example 1:

WARN scheduler.TaskSetManager: Lost task 145.0 in stage 4.0 (TID 4988, rhel8.cisco.com): FetchFailed (BlockManagerId (2, rhel4.cisco.com, 49209), shuffleId=13, mapId=47, reduceId=145 Message= org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: / data6/hadoop/yarn/local/usercache/spark/appcache/application_1447965002296_0142/blockmgr-44627d4c-4a2b-4f53-a471-32085a252cb0/15/shuffle_13_119_0.index (No such file or directory) at java.io.FileInputStream.open0 (Native Method) at java.io.FileInputStream.open (FileInputStream.java:195)

Error example 2:

WARN scheduler.TaskSetManager: Lost task 1.0 in stage 13.1 (TID 13621, rhel7.cisco.com): FetchFailed (null, shuffleId=9, mapId=-1, reduceId=148 Message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 9 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply (MapOutputTracker.scala:460) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply (MapOutputTracker.scala:456) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply (TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach (IndexedSeqOptimized.scala:33) at scala .substitution.mutable.ArrayOps $ofRef.foreach (ArrayOps.scala:108)

Error example 3:

ERROR cluster.YarnScheduler: Lost executor 59 on rhel4.cisco.com: remote Rpc client disassociated

The failure of most queries forces Spark to try again by requeuing the task (or even restarting a phase). Things have gotten worse since then; in the end, the application failed, as if it would never be finished.

By switching to Parquet, these issues are resolved without any other Spark configuration changes. Compression reduces the file size, the column format allows only selected records to be read, and the reduced input data directly affects the Spark DAG scheduler's decision on the execution diagram (see below for more details). All of these advantages of Parquet are critical to the rapid completion of queries.

3. Fewer disk IO

Parquet with compression can reduce data storage by an average of 75%, that is, data files with 1TB compression ratio only take up about 250 GB of disk space on disk. This significantly reduces the input data required by the Spark SQL application. And in Spark 1.6.0, the Parquet reader uses a push-down filter to further reduce disk IO. Push-down filters allow data selection decisions to be made before data is read into Spark. For example, the processing of the between clause in query 97 is as follows:

Select cs_bill_customer_sk customer_sk, cs_item_sk item_skfrom catalog_sales,date_dimwhere cs_sold_date_sk = d_date_sk and d_month_seq between 1200 and 1200 + 11

Spark SQL shows the following scan statement in the physical plan of the query:

+-Scan part relation [d _ date_sk#141,d_month_seq#144L] InputPaths: hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/_SUCCESS, hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/_common_metadata, hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/_metadata Hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/part-r-00000-4d205b7e-b21d-4e8b-81ac-d2a1f3dd3246.gz.parquet, hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/part-r-00001-4d205b7e-b21d-4e8b-81ac-d2a1f3dd3246.gz.parquet, PushedFilters: [GreaterThanOrEqual (dumbmonthhorse seq.1200), LessThanOrEqual (dumbmonthworm seq.1211)]]

Where PushedFilters returns only records in the range of 1200 to 1211 in the d_mont_seq column, or only a few records. Compared to a flat file, when using a flat file, the entire table (each column and each row) is read, as shown in the physical plan:

[Scan CsvRelation (hdfs://rhel10.cisco.com/user/spark/hadoopds1000g/date_dim/*,false, |, ",", null,PERMISSIVE,COMMONS,false,false,StructType (StructField), StructField (dumbdatebearidGravity StringTypepene false), StructField (dumbdatephion StringTypeGraintrue), StructField (dumbmonthtrainseqdLongTypeLigue true), StructField (dumbquartermeasurseqdLong TypeMagne true), StructField (dumbquartermeasurseqhelium LongTypeGrader true), StructField (ditching yearbook and LongTypeMagna true) StructField, etc. True, StructField, d_current_year (StringType,true)) [dwelling datebooks skids 141resting datebooks idling 142revising monthwritten seqstones 144L; dandelike seqlies 145L; parallels quartersand seqlums 146L; yearbooks 147L; 148L; moybooks; 149L; 150L; dumbqoyies 151L; 151L; 152L; dichotomies, sequres; 153L; D_holiday#157,d_weekend#158,d_following_holiday#159,d_first_dom#160L,d_last_dom#161L,d_same_day_ly#162L,d_same_day_lq#163L,d_current_day#164,d_current_week#165,d_current_month#166,d_current_quarter#167,d_current_year#168]]

4. Spark 1.6.0 provides higher scanning throughput

Significant flat scan throughput was mentioned in Databricks's Spark 1.6.0 release blog because the blog used the word "more optimized code path". To illustrate this in the real world, we ran query 97 in Spark 1.5.1 and 1.6.0 and captured nmon data. The improvement is very obvious.

First, the query response time was halved: query 97 took 138 seconds in Spark 1.5.1, compared with 60 seconds in Spark 1.6.0.

Figure 2. Time spent querying 97 when using Parquet (in seconds)

Second, in Spark 1.6.0, CPU usage on the worker node is lower, mainly due to SPARK-11787:

Figure 3. CPU utilization of query 97 in Spark 1.6.0, 70% at *

Figure 4. CPU utilization of query 97 in Spark 1.5.1, 100% at *

Related to the above data, the disk read throughput is 50% higher in Spark 1.6.0:

Figure 5. Disk read throughput in Spark 1.5.1 and 1.6.0

5. Efficient Spark execution diagram

In addition to smarter readers such as Parquet, the data format also directly affects the Spark execution diagram, because one of the main inputs to the scheduler is RDD counting. In our example, we ran the same query 97 on Spark 1.5.1 using text and Parquet, and we got the following execution mode for each phase.

Use text-there are many long-running phases (note that the units used on the y-axis are milliseconds)

Figure 6. Use the execution phase of the text

When using Parquet, although there are more phases, the work is executed quickly, and only two long-running phases are created and are nearing the end of the work. This indicates that the boundaries of the "parent-child" phase become clearer, so there is less intermediate data that needs to be saved to disk and / or through network nodes, which speeds up end-to-end execution.

Figure 7. Use the run time of Parquet

Parquet performs very well when used in Spark SQL. It not only provides a higher compression ratio, but also allows only records of interest to be read through selected columns and low-level reader filters. Therefore, if you need to pass data multiple times, it may be worthwhile to spend some time coding an existing flat file.

What are the five major reasons for choosing Parquet for Spark SQL to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.