How to understand the High-level General tuning in spark tuning 07/15 Update SLTechnology News&Howtos

How to understand the High-level General tuning in spark tuning

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Many novices are not very clear about how to understand the high-level general tuning in spark tuning. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

One, parallelism

Clusters are not fully utilized unless you set the level of parallelism for each operation to be high enough. Spark automatically sets the number of map according to the size of the file, whether it is divisible or not (the input format will be explained in detail later, and the decision on the number of map input will be explained in detail). For distributed reduce operations, such as groupbykey and reducebykey, by default it uses the number of partitions of the parent RDD with the largest number of partitions to determine the number of reduce. You can also change the default value by setting spark.default.parallelism. The recommended value is 2-3 tasks per CPU.

Second, memory usage of Reduce tasks

Sometimes memory overflow is not because your RDD is not suitable for memory, but because the working set of one of your task is too large, for example, when using groupbykey, the reduce task dataset is too large. Spark's shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table, and each task executes a packet of data, which tends to be large. The easiest way to improve is to increase parallelism so that the input to each task becomes smaller. Spark can efficiently support tasks as short as 200ms, because it reuses Executor's JVM, which reduces startup costs, so you can safely increase parallelism to exceed the number of core in your cluster.

Third, broadcast variables

Using spark's broadcast capabilities can significantly reduce the size of each serialized task, as well as the cost of executing a job in a cluster. If you use large objects, such as static tables, in your task, consider declaring it as a broadcast variable. On the driver node, spark prints out the serialized size of each task, so you can determine whether your task is too large by looking at the size of the task. Usually, the size of the task is larger than the 20KB is worth tuning.

Fourth, data locality

The locality of the data can have a significant impact on Spark jobs. If the data is with the code that operates on it, the calculation is often fast. But if the code is separated from the data, one side must move. Typically, the serialized code is moved to where the data is, because the data tends to be much larger than the code. The principle for Spark to build a scheduling plan is data locality.

Data locality is how far the data is from the code that processes it. The level of data locality based on the current location of the data and code. The order from closest to farthest is listed as follows:

1,PROCESS_LOCAL

Data and code are in the same JVM, which is the best data locality.

2,NODE_LOCAL

The data and code are on the same node. For example, the data is on the HDFS of the same node, or on the Executor of the unified node. Because data moves between multiple processes, it is slightly slower than PROCESS_LOCAL.

3,NO_PREF

Data can be accessed quickly from anywhere without data locality.

4,RACK_LOCAL

The data and code are in the same rack. Data is on different servers on the same rack, so it needs to be sent over the network, usually through a single switch

5,ANY

The data is elsewhere on the network, not in the same rack.

Spark tends to schedule tasks based on the highest data locality, but this is often impossible. In any case where there is no unprocessed data on any idle Executor, the Spark switches to lower data locality. In this case, there are two options:

1), wait for CPU to be idle, and then start task on the same server.

2) immediately start a new task in a remote location where the data needs to be migrated.

The typical processing strategy of Spark is to wait for a busy CPU to be released for a short time. Once the timeout occurs, the data is moved to the place where the free CPU is available to perform the task. The fallback wait timeout between each level can be configured individually or all in a single parameter. If the task is long and the data locality is poor, the configuration related to the Spark.locatity timeout can be adjusted appropriately. The specific configuration is as follows:

Attribute

Default value

Meaning

Spark.locality.wait

Timeout, abandon waiting for new tasks in lower data locality.

Spark.locality.wait.node

Spark.locality.wait

NODE_LOCAL wait timeout

Spark.locality.wait.process

Spark.locality.wait

PROCESS_LOCAL wait timeout

Spark.locality.wait.rack

Spark.locality.wait

RACK_LOCAL wait timeout

The main tuning is serialization and memory tuning.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.