What is the tilt performance and solution of Spark data in big data 04/29 Update SLTechnology News&Howtos

What is the tilt performance and solution of Spark data in big data

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about the tilt performance of Spark data in big data and what the solution is. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Data skew description

Tilt performance of Spark data

Most of the task execution times of the Spark engine are consistent, but there are some task execution times that are particularly long, such as 500 task, of which 498 are fast, 10 minutes complete, and the remaining two task require more than half an hour of execution.

Routine execution of the code, one day there is an OOM problem, there is a high probability that there is a data skew.

The reason for data skew is that when you shuffle, you need to pull the same key of each node to the same node. If the amount of data corresponding to this key is very large, data skew will occur.

Data skew only occurs in the shuffle process. Spark engine triggers Shuffle's RDD operators: distinct, repartition, reduceByKey, groupByKey, aggregateByKey, join

Common solutions adjust parallelism

You need to set the parallelism directly on the operator of Shuffle or use the spark.default.parallelism setting. If it is Spark SQL, you can also set the parallelism through SET spark.sql.shuffle.partitions=num_tasks.

This method uses few scenarios and can only alleviate the data tilt, but can not completely solve the data tilt.

Map side join

Through the Broadcast mechanism of Spark, Reduce Join is transformed into Map Join, and Shuffle is avoided, thus the data tilt brought by Shuffle is completely eliminated.

The dataset on the participating side of Join is small enough, and it is mainly suitable for Join scenarios, but not suitable for aggregation scenarios, and the applicable conditions are limited.

Outlier filtering

Through the reduceByKey of Spark, count the number of each key, exceeding the specified number of key or the number of key of top, as an exception key. Of course, you can also use Sample to sample RDD, and then carry out key statistics.

The characteristics of this method are: simple, rough, and has certain applicable scenarios.

Key value conversion: adding random numbers

This can be understood as a big move.

For Shuffle operations of a single RDD, such as groupByKey, prefix the key value with a random number. This requires a secondary aggregation operation.

For Shuffle operations of multiple RDD, such as join, add the prefix of a random number within n to the key of one of the RDD with obvious data skew, and each key of the other RDD is prefixed with the prefix of 0murn, which is equivalent to the expansion of the RDD by n times.

The combination of the above solutions may be required in the actual scenario, such as outlier filtering + key value conversion: adding random numbers to optimize the performance: split the RDD into two RDD according to the outliers, and normal operation for those without data tilt. Add random prefixes to those with skewed data, and then perform Shuffle operation.

After reading the above, do you have any further understanding of the Spark data skew performance and solution in big data? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.