An example Analysis of whether the SPARK Task is data skew 07/13 Update SLTechnology News&Howtos

An example Analysis of whether the SPARK Task is data skew

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is an example analysis of whether the SPARK task is data skew. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Comparison before and after fitness

On the way back to the gym, I saw the WeChat group chatting about technology, and a group asked a magic question. For more information, please see the screenshot below:

The conclusion given by my buddy is the data tilt caused by repartition, and I replied to him in detail that it is not the data tilt. Well, next, let's carefully analyze the reasons.

In order for you to have a more thorough understanding of this piece of content, a short video was also recorded at the bottom of the article.

That dude's number is the reason for the data tilt caused by repartition, because the first three rows of data input and output are hundreds of megabytes, while the latter are only a few MB inputs, 0B outputs, so the conclusion is data tilt.

The reason why the wave tip corrects him is wrong is that the data tilt often refers to the same stage: some task data is large, some task data is small, and the data size gap between task is relatively large, but this is obviously not. This is the page of executor. If you look at the complete task column, you will find that the first three lines occupy almost all task execution, and the number of task completed is 10 to 20 times that of the rest. This is the reason for the large amount of input and output data in the first three lines.

The locality of data is the root cause of this problem. Because the data local task scheduling will first be dispatched to the executor machine where the data resides, if the machine executor exists, the executing task will wait for a time within which the task is finished, and the new task will be dispatched directly to the executor. So back and forth, resulting in a large gap in the task handled by executor.

The official website provides the wait time configuration for data local degradation when spark schedules task.

Quite simply, set 3s to 0s, and the result is that task will not wait for the nature of the data to degrade and schedule execution immediately.

In fact, the root cause is that there were not enough partition when kafka created topic. The throughput of a single parition can reach tens of thousands of qps, but combined with business logic, the throughput will drop sharply with different data output locations. Therefore, the number of topic partitions should be set according to the processing logic, landing location and disk number.

The above is an example of whether the SPARK task is skewed or not. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.