What are the skills to improve the performance of Apache Spark? 07/07 Update SLTechnology News&Howtos

What are the skills to improve the performance of Apache Spark?

2025-07-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What are the skills to improve the performance of Apache Spark, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Make your Apache Spark application run faster with minimal code changes!

Introduction

One of the most time-consuming parts of developing Spark applications is optimization. In this blog post, I will provide some performance tips and (at least for me) unknown configuration parameters that may be used at startup.

Therefore, I will introduce the following topics:

Multiple small files as source

Random zoning parameter

Forced broadcast Join

Partition vs merge vs Random Partition Parameter Settings

What can we improve?

1. Use multiple small files?

OpenCostInBytes (from document)-the estimated cost (in bytes) of an open file can be scanned at the same time. Used when putting multiple files into a partition. It's best to overestimate, and then partitions with smaller files will be faster than partitions with larger files (scheduled first). The default is 4MB.

Spark.conf.set ("spark.files.openCostInBytes", SOME_COST_IN_BYTES)

I tested the 1GB folder containing 12000 files, the 7.8GB folder containing 800 files, and the 18GB folder containing 1.6k files. My goal is to find out whether the input file is small, and it is best to use a file that is lower than the default value.

Therefore, when testing 1GB and 7.8GB folders-definitely lower values, but testing files about 11MB, larger parameter values are better.

Use an openCostInBytes size close to your small file size. It will be more efficient!

two。 Random partition

When I started using Spark, it inexplicably occurred to me that the configuration I set when I created the Spark session was immutable. Oh, my God, how wrong I am.

Therefore, in general, when aggregating or joining, the spark partition is a static number in spark (the default is 200). Depending on the size of your data, this can cause two problems:

The dataset is small-200 is too many, data is scattered and inefficient

The dataset is huge-200 is too small. The data is wasted and we don't make full use of all the resources we want.

So, I had some trouble with this kind of problem, and I spent a lot of time on Google and discovered this beautiful thing.

Spark.conf.set ("spark.sql.shuffle.partitions", X)

You can change this neat configuration anytime and anywhere in the middle of the run, which affects the steps that are triggered after setting up. You can also use this bad boy when creating a Spark session. This number of partitions is used when data is mixed for joins or aggregations. The data frame partition count is also obtained:

Df.rdd.getNumPartitions ()

You can estimate the most appropriate number of mashup partitions for further joins and aggregations.

That is, you have a huge data box and want to keep some information. In this way, you get the number of partitions of big data frames. Set the shuffle partition parameter to this value. In this way, it will not become the default value of 200 after joining! More parallelism-here we come!

3. Broadcast Join

A very simple case: we have a large table that contains all users, while our table contains internal users, quality inspectors, and other users who should not be included. The goal is to leave non-insiders.

Read two tables

Huge_table left anti-join small table

It looks like a good solution with simple and smart performance. If your small table is smaller than 10MB, your small dataset will be broadcast without any prompts. If you add a hint to your code, you may make it run on a larger dataset, depending on the behavior of the optimizer.

However, assume that it is 100-200MB and prompt you not to force it to be broadcast. Therefore, if you are sure that it will not affect the performance of your code (or raise some OOM errors), you can use it and override the default values:

Spark.conf.set ("spark.sql.autoBroadcastJoinThreshold", SIZE_OF_SMALLER_DATASET)

In this case, it will be broadcast to all executors, and joining should work faster.

Watch out for OOM errors!

4. Partition vs merge vs random partition configuration settings

If you are using Spark, you may know the rezoning method. To me, the way methods from the SQL background merge has a different meaning! Obviously, when spark merging occurs on a partition, it behaves differently-it moves and groups multiple partitions together. Basically, we keep data reorganization and movement to a minimum.

If we only need to reduce the number of partitions, we should use merging instead of repartitioning, because this minimizes data movement and does not trigger the exchange. If we want to divide the data more evenly between partitions, please repartition.

However, suppose we have a recurring pattern, we perform the join / transformation and get 200 partitions, but we don't need 200 partitions, that is, 100 or even 1.

Let's try to make a comparison. We will read the folder of 11MB and summarize it as before.

By persisting the data frame on the storage-only option disk, we can estimate the data frame size. So small_df has only 10 MB, but the number of partitions is 200. Wait? On average, each partition can provide 50KB data, which is not efficient. Therefore, we will read the big data frame, set the aggregated partition count to 1, and force Spark to execute, and finally we will count it as an operation.

This is our implementation plan for three situations:

> Setting shuffle partition parameter

> Coalesce action

> Repartitioning

Therefore, of all the visible settings, we will not invoke the other steps of Coalesce / Exchange (the repartition operation). Therefore, we can save some execution time by skipping it. If we take a look at the execution time: Shuffle Partition is set at 7.1 minutes, and Coalesce 8.1 is completed in Coalesce 8.1 PersonRepartition 8.3.

This is just a simple example that still shows how much time can be saved by setting a configuration parameter!

There are many small and simple tips and tricks on how to make your Apache Spark application run faster and more efficiently. Unfortunately, when using Spark, the solution is stand-alone in most cases. In order for it to work, most of the time you must know the inside of Spark's internal components and read the document many times from beginning to end.

I mentioned how to read multiple small files faster, how to force broadcast connections, choose when to use shuffle partition parameters, merge and repartition.

After reading the above, have you mastered the skills to improve the performance of Apache Spark? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.