What are the common means of Spark tuning in big data's development? 07/09 Update SLTechnology News&Howtos

What are the common means of Spark tuning in big data's development?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What is the common means of Spark tuning in big data's development? in order to solve this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Spark tuning

Spark tuning common means, often encounter a variety of problems in production, pre-incident reasons, non-standard reasons, spark tuning summed up can be tuned from the following points.

1. Allocate more resources: it is the king of performance optimization, that is, to increase and allocate more resources, which is obvious for the improvement of performance and speed, basically, within a certain range, the increase of resources is proportional to the improvement of performance. After writing a complex spark job, when performing performance tuning, the first step is to adjust the optimal resource allocation; on this basis, if your spark job can allocate resources to the top of your ability range, you can no longer allocate more resources, and the company's resources are limited; then it is the point to consider doing the following performance tuning. Related questions: (1) what resources are allocated? (2) where can these resources be set up? (3) analyze why performance can be improved after allocating these resources? 1.1 which resources are allocated executor-memory, executor-cores, driver-memory

1.2 where can these resources be set up

In a real production environment, when submitting a spark task, use the spark-submit shell script to adjust the corresponding parameters. Script to submit the task: spark-submit\-master spark://node1:7077\-class com.hoult.WordCount\-num-executors 3\ configure the number of executor-driver-memory 1g\ configure driver memory (little impact)-executor-memory 1g\ configure the memory size of each executor-executor-cores 3\ configure the number of cpu per executor / how much the export/servers/wordcount.jar1.2 parameter is adjusted to It's the biggest.

= = Standalone mode =

First calculate the memory size and cpu core number of each node for all the resources on the company's spark cluster, for example, there are 20 worker nodes, each node has 8g memory, and 10 cpu. When the actual task is given a resource, it can give 20 executor, 8g memory per executor, and 10 cpu per executor.

= = Yarn mode =

First calculate all the sizes of the yarn cluster, such as a total of 500g of memory, the maximum resources that can be allocated at this time of 100 cpu;, for example, given 50 executor, the memory size of each executor is 10g, and the number of cpu used by each executor is 2.

Principle of use

In the case of sufficient resources, use more computing resources as much as possible, and try to adjust to the maximum size 1.3.Why the performance can be improved after enlarging the resources-executor-memory--total-executor-cores2. Improve parallelism 2.1 Spark parallelism refers to what is the number of task in each stage in a spark job, which represents the parallelism of stage in each stage of a spark job! When you have allocated the maximum resources that can be allocated, and then adjust the parallelism of the program corresponding to the resources, if the parallelism does not match the resources, then the resources you allocate will be wasted. Running in parallel at the same time also reduces the amount of processing per task (simple principle). Reasonable setting of parallelism can make full use of cluster resources, reduce the amount of data processed by each task, and increase performance to speed up the running speed. ) 2.2 how to improve parallelism 2.2.1 you can set the number of task to be at least the same as the total number of cpu core in spark Application. Ideally, 150 core, assign 150task, run together, run at about the same time the official recommendation, the number of task, set to 2-3 times the total number of spark Application cpu core. For example, 150 cpu core, basically set the number of task to 300 to 500. Unlike the ideal situation, some task will run faster, such as 50s, and some task may be a little slower and take a minute and a half to run, so if your number of task is exactly the same as the number of cpu core, it may lead to a waste of resources. Because, for example, 10 of the 150 task are finished first, and the remaining 140 are still running, but at this time, 10 cpu core are idle, resulting in waste. If you set 2x3 times, then after one task is running, the other task will be made up as soon as possible, so that the cpu core will not be idle as much as possible. At the same time, try to improve the efficiency and speed of spark operation. Improve performance. 2.2.2 how to set the number of task to improve parallelism the setting parameter spark.default.parallelism has no value by default. If the value is set to 10, it will only work in the process of shuffle. For example, val rdd2 = rdd1.reduceByKey (_ + _). The number of partitions of rdd2 can be set when building SparkConf objects, for example: new SparkConf (). Set ("spark.defalut.parallelism", "500") 2.2.3 reset the number of partition to RDD to re-partition using rdd.repartition, this method will generate a new rdd, making the number of partitions larger. At this time, because one partition corresponds to one task, the more the corresponding task, the more parallelism can be improved in this way. 2.2.4 increase the number of task running by sparksql

Http://spark.apache.org/docs/2.3.3/sql-programming-guide.html

The parallelism can be improved by setting the parameter spark.sql.shuffle.partitions=500 to 200 by default; it can be increased appropriately. For example, set it to spark.sql.shuffle.partitions=500

Set specifically for sparkSQL

3. Reuse and persistence of RDD 3.1 the actual development situation illustrates the computing logic shown in the figure above: (1) when you first use rdd2 to do the corresponding operator operation to get rdd3, it will be calculated from rdd1, first read the file on HDFS, then do the corresponding operator operation on rdd1 to get rdd2, and then get rdd3 after calculation by rdd2. Also in order to calculate the rdd4, the previous logic will be recalculated. (3) by default, if you perform operator operations on a rdd multiple times to get different rdd, all of the rdd and the previous parent rdd will be recalculated. This situation is often encountered when actually developing code, but we must avoid repeated calculations in a rdd, otherwise it will lead to a sharp decline in performance. Summary: you can persist the repeatedly used rdd, that is, the public rdd, to avoid subsequent needs, recalculate again, and improve efficiency. 3.2 how to persist rdd

You can call the cache or persist method of rdd.

(1) the cache method persists the data to memory by default, for example: rdd.cache, which essentially calls the persist method. (2) there are rich cache levels in the persist method, which are defined in the object of StorageLevel, which can be reasonably set according to the actual application scenario. For example: rdd.persist (StorageLevel.MEMORY_ONLY), which is the implementation of the cache method. Serialization can be used for rdd persistence (1) if data is normally persisted in memory, it may result in excessive memory footprint, which may lead to OOM memory overflow. (2) when pure memory can not support the complete storage of public RDD data, it is preferred to use serialization to store in pure memory. Serialize the data of each partition of RDD into a byte array; after serialization, the memory footprint is greatly reduced. (3) the only disadvantage of serialization is that it needs deserialization when getting data. However, it can reduce the space occupied and facilitate network transmission. (4) if serializing pure memory mode, it will still lead to OOM, memory overflow; you can only consider the disk way, memory + disk normal way (no serialization). (5) for the sake of high reliability of data and sufficient memory, we can use the double copy mechanism to persist the double copy mechanism. If the machine is down and the copy is lost, we still have to recalculate it. For each persistent data unit, a copy is stored and placed on other nodes for fault tolerance. One copy is lost, there is no need to recalculate, another copy can be used. In this way, only for your extremely abundant memory resources. For example: StorageLevel.MEMORY_ONLY_24. The use of broadcast variables 4.1 scenario description may encounter such a situation in practical work. Due to the large amount of data to be processed, a large number of task may appear in a stage, such as 1000 task. These task all need the same data for business processing. The size of this data is 100m. The data will be copied 1000 copies and transmitted to each task through the network for use by task. There will be a lot of network transmission overhead, and at least the memory required is 1000*100M=100G, which is very large. Unnecessary memory consumption and occupation causes you to persist RDD to memory, so you may not be able to put it down completely in memory; you can only write to disk, resulting in subsequent operations to consume performance on disk IO; this is a disaster for spark task processing. Due to the large memory overhead, when task creates objects, it may not store all the objects in the heap, which will lead to frequent recycling of GC by the garbage collector. GC must cause the worker thread to stop, that is, cause Spark to suspend work for a little bit of time. Frequent GC can have a considerable impact on the speed of Spark jobs. 4.2 broadcast variables are introduced into distributed execution code in Spark that needs to be passed to the task of each executor to run. For some read-only, fixed data, it is inefficient to broadcast Driver to each Task every time. Broadcast variables allow variables to be broadcast only to individual executor. Each task on the executor obtains variables from the BlockManager of the node in which it is located (responsible for managing the memory and disk data corresponding to an executor), rather than from the Driver, thus improving efficiency. The broadcast variable, initially, has a copy on Drvier. By converting shared data into broadcast variables in Driver. When task is running, if you want to use the data in the broadcast variable, you will first try to obtain a copy of the variable in the BlockManager corresponding to your local Executor; if it is not available locally, then pull the copy of the broadcast variable remotely from Driver and save it in the local BlockManager. After that, the task on this executor will directly use the copy in the local BlockManager. Then all task in the executor will use a copy of the broadcast variable at this time. In other words, an executor only needs to get a copy of the broadcast variable data when the first task is started, and the subsequent task gets the relevant data from the BlockManager of the node. In addition to pulling the BlockManager from the driver, the BlockManager of executor may also pull copies of variables from the BlockManager of other nodes. The closer the network, the better. Performance analysis using broadcast variables such as 50 executor,1000 task for a task and 100m shared data. (1) without using broadcast variables, 1000 copies of the shared data are required for 1000 task, that is, 1000 copies require a lot of network transmission and memory overhead storage. The amount of memory consumed is 1000mm 100g 100g. (2) after using broadcast variables, 50 executor only need 50 copies of data, and not all of them are transmitted from Driver to each node. It is also possible to pull copies of broadcast variables from the nearest node's executor blockmanager, which greatly increases the transmission speed of the network. Memory cost 50*100M=5G summary: the memory cost without broadcast variables is 100 GB, and the memory cost after use is 5 GB, which makes a difference of about 20 times the network transmission performance loss and memory overhead. The performance improvement and impact after using broadcast variables is still considerable. The use of broadcast variables does not necessarily have a decisive effect on performance. For example, if you run a spark job for 30 minutes, it may be 2 minutes faster, or 5 minutes faster, after doing the broadcast variable. But little by little tuning adds up. It will work in the end. 4.4 considerations for the use of broadcast variables (1) can an RDD be broadcast using broadcast variables? No, because RDD does not store data. The results of RDD can be broadcast. (2) broadcast variables can only be defined on the driver side, not on the executor side. (3) the value of the broadcast variable can be modified on the driver side, but not on the executor side. (4) if the Driver variable is used on the executor side, if you do not use the broadcast variable on the Executor, there will be as many copies of the variable on the driver side as there are in the task. (5) if the Driver variable is used on the Executor side, if the broadcast variable is used, there is only one copy of the variable on the driver side in each Executor. 4.5 how to use broadcast variables

For example

(1) use sparkContext's broadcast method to convert the data into broadcast variables of type Broadcast, val broadcastArray: broadcast [Array [int]] = sc.broadcast (Array) (2) then the BlockManager on executor can pull a copy of the broadcast variable to obtain specific data. The value in the broadcast variable can be obtained by calling its value method val array: Array [Int] = broadcastArray.value5. Try to avoid using shuffle class operator 5.1 shuffle to describe the shuffle in spark involves a large number of network transmission, the downstream task task needs to pull the output data of the previous stage task through the network, the shuffle process, to put it simply, is to pull the same key distributed on multiple nodes in the cluster to the same node for aggregation or join and other operations. For example, operators such as reduceByKey and join will trigger shuffle operations. If possible, try to avoid using shuffle class operators. Because the most performance-consuming part of a Spark job is the shuffle process. 5.2 which operator operations will generate shufflespark programs that use reduceByKey, join, distinct, repartition and other operators in the development process, here will generate shuffle, as shuffle this piece is very performance-consuming, the actual development as far as possible to use map class non-shuffle operator. In this way, Spark jobs with no shuffle operations or only fewer shuffle operations can greatly reduce performance overhead. 5.3 how to avoid generating shuffle

Small case

/ / wrong practice: / / traditional join operation will lead to shuffle operation. / / because in both RDD, the same key needs to be pulled to a node through the network, and the join operation is performed by a task. Val rdd3 = rdd1.join (rdd2) / / correct practice: / / join operation of Broadcast+map will not result in shuffle operation. / / use Broadcast to use a RDD with a small amount of data as a broadcast variable. Val rdd2Data = rdd2.collect () val rdd2DataBroadcast = sc.broadcast (rdd2Data) / / in the rdd1.map operator, you can get all the data for rdd2 from rdd2DataBroadcast. / / then traverse, and if it is found that the key of a piece of data in rdd2 is the same as the key of the current data of rdd1, then it is determined that join can be carried out. / / at this point, you can splice the current data of rdd1 with the data that can be connected in rdd2 (String or Tuple) in any way you need. Val rdd3 = rdd1.map (rdd2DataBroadcast...) / / Note that the above operations are recommended only when the amount of data in rdd2 is relatively small (for example, a few hundred megabytes, or one or two gigabytes). / / because a copy of the full data of Executor resides in the memory of each rdd2. 5.4 shuffle operations using map-side preaggregation

Map-side prepolymerization

If you must use shuffle operations because of business needs and cannot be replaced by operators of the map class, try to use operators that can be pre-aggregated by map-side. The so-called map-side preaggregation refers to an aggregation operation for the same key locally on each node, similar to the local combiner in MapReduce. After map-side pre-aggregation, there is only one same key locally for each node, because multiple identical key are aggregated. When other nodes pull the same key on all nodes, the amount of data that needs to be pulled is greatly reduced, thus reducing the disk IO and network transmission overhead. Generally speaking, where possible, it is recommended to use reduceByKey or aggregateByKey operators instead of groupByKey operators. Because both the reduceByKey and aggregateByKey operators preaggregate the same key locally to each node using user-defined functions. The groupByKey operator will not carry out pre-aggregation, and the full amount of data will be distributed and transmitted among the nodes of the cluster, so the performance is relatively poor. For example, the following two pictures are typical examples of word counting based on reduceByKey and groupByKey, respectively. The first diagram is the schematic diagram of groupByKey, and you can see that when there is no local aggregation, all data is transferred between cluster nodes; the second diagram is the schematic diagram of reduceByKey, and you can see that the same key data locally on each node is pre-aggregated before being transferred to other nodes for global aggregation.

= = groupByKey word counting principle = =

= = reduceByKey word counting principle =

6. Use high-performance operator 6.1 to use reduceByKey/aggregateByKey instead of groupByKey

ReduceByKey/aggregateByKey can perform pre-aggregation operations to reduce data transmission and improve performance.

GroupByKey will not carry out pre-aggregation operation and pull all data, so the performance is relatively low.

Using mapPartitions instead of the ordinary mapmapPartitions class operator, one function call will process all the data of one partition, instead of one at a time, the performance will be relatively higher. But sometimes there is an OOM (memory overflow) problem when using mapPartitions. Because a single function call will dispose of all the data of a partition, if there is not enough memory, it is impossible to collect too many objects during garbage collection, and an OOM exception is likely to occur. So be careful when using this kind of operation! The principle of using foreachPartition instead of foreach is similar to "using mapPartitions instead of map". It also processes all the data of one partition at a time, rather than one piece of data at a time. In practice, it is found that the operator of foreachPartitions class is very helpful to improve the performance. For example, in the foreach function, if all the data in RDD is written to MySQL, then if it is an ordinary foreach operator, it will be written piece by piece of data, and each function call may create a database connection. At this time, database connections will be created and destroyed frequently, and the performance is very low. However, if you use the foreachPartitions operator to process the data of one partition at a time, then for each partition, just create a database connection, and then perform a batch insert operation, at this time the performance is relatively high. In practice, it is found that the performance of MySQL can be improved by more than 30% for about 10, 000 pieces of data. 6.4 coalesce operation after using filter usually executes filter operator on a RDD to filter out more data in RDD (such as more than 30% data). It is recommended to use coalesce operator to manually reduce the number of partition in RDD and compress the data in RDD into less partition. Because after the filter, a lot of data will be filtered out in each partition of the RDD. If the subsequent calculation is carried out as usual, the amount of data in the partition processed by each task is not very large, which is a bit of a waste of resources, and the more task processed at this time, the slower the speed may be. So reduce the number of partition with coalesce, compress the data in RDD into less partition, and you can process all the partition with less task. In some scenarios, it will be helpful to improve performance. 6.5. using repartitionAndSortWithinPartitions instead of repartition and sort class to operate repartitionAndSortWithinPartitions is a recommended operator on the official website of Spark. It is officially recommended that if you need to sort after repartition re-partition, you should use repartitionAndSortWithinPartitions operator directly. Because this operator can sort while performing the shuffle operation of re-partition. Shuffle and sort operate at the same time, which may have higher performance than first shuffle and then sort. 7. Using Kryo to optimize serialization performance 7.1 spark serialization introduction Spark in the task calculation, will involve the data across the process of network transmission, data persistence, when the need for data serialization. Spark defaults to Java's serializer. The advantages and disadvantages of default java serialization are as follows: it is easy to handle, there is no need for us to do anything else manually, but we just need to implement the Serializble interface when using an object and variable. Its disadvantages: the efficiency of the default serialization mechanism is not high, the serialization speed is relatively slow, and the memory space of the serialized data is relatively large. Spark supports the use of Kryo serialization mechanism. The Kryo serialization mechanism is faster than the default Java serialization mechanism, and the serialized data is smaller, which is about 1 prime 10 of the Java serialization mechanism. So after Kryo serialization optimization, less data can be transmitted over the network, and the memory resources consumed in the cluster can be greatly reduced. After Kryo serialization is enabled, the local Kryo serialization mechanism will take effect in several places: (1) the external variables used in the operator function may involve network transmission with driver, so serialization is needed. Finally, the performance of network transmission can be optimized, and the memory consumption and memory consumption in the cluster can be optimized. (2) serialization is performed when RDD is persisted, and serialization is needed in the corresponding storage level when rdd is persisted by StorageLevel.MEMORY_ONLY_SER. Finally, the memory consumption and consumption can be optimized; the less memory consumed by persistent RDD, the less memory is consumed by task, the objects created will not frequently fill up memory, and GC will occur frequently. (3) where the shuffle is generated, that is, the task in the downstream stage is widely dependent. The result data generated by the task in the upstream stage is pulled and transmitted across the network, and serialization is needed. Finally, you can optimize the performance of network traffic. 7.3.How to turn on the Kryo serialization mechanism / / create SparkConf objects. Val conf = new SparkConf () .setMaster (...) .setAppName (...) / / sets the serializer to KryoSerializer. Conf.set ("spark.serializer", "org.apache.spark.serializer.KryoSerializer") / / registers the custom type to serialize. Conf.registerKryoClasses (Array (classOf [MyClass1], classOf [MyClass2]) 8. Using fastutil to optimize data format 8.1 fastutil introduction fastutil is a class library that extends the Java standard collection framework (Map, List, Set;HashMap, ArrayList, HashSet), providing special types of map, set, list and queue;fastutil to provide smaller memory footprint and faster access speed We use the collection class provided by fastutil to replace the native Map, List, Set.8.2 fastutil benefit fastutil collection class of JDK, which can reduce the memory consumption, and when traversing the collection, getting the value of the element according to the index (or key) and setting the value of the element. Provide faster access to 8.3 Spark scenarios using fastutil and use 8.3.1 operator functions using external variables (1) you can use Broadcast broadcast variable optimization (2) you can use Kryo serialization class library to improve serialization performance and efficiency; (3) if external variables are some kind of large collection, then consider using fastutil to rewrite external variables; first of all, reduce memory footprint (fastutil) from the source, further reduce memory footprint by broadcasting variables, and then further reduce memory footprint by Kryo serialization class library. 8.3.2 A relatively large set Map/List is used in your operator function, that is, in the computational logic to be executed by task, if there is logic, it may take up a large memory space to create larger sets such as Map, List, etc., and may involve set operations such as traversal and access of consuming performance. At this point, you can consider rewriting these collection types using the fastutil class library. After using the fastutil collection class, you can reduce the memory footprint of the collection types created by task to a certain extent. Avoid frequently filling up executor memory and evoking GC frequently, resulting in performance degradation. 8.3.3 the first step in the use of fastutil: refer to the package fastutil fastutil 5.0.9 of fastutil in pom.xml. Step 2: replace List (Integer) with IntList. The list of List corresponds to fastutil is the IntList type instructions: basically, the format is similar to IntList, and the prefix is the element type of the collection; the special one is Map,Int2IntMap, which represents the element type of key-value mapping. 9. Adjust the data localization waiting time Spark before allocating the task of each stage of Application on Driver, it will calculate which shard data is to be calculated by each task. The task allocation algorithm of a certain partition;Spark of RDD will first hope that each task will be allocated exactly to the node where the data to be calculated, so that there is no need to transfer data between networks. But generally speaking, sometimes things go against one's wishes. Task may not have the opportunity to assign to the node where its data is located. Why? maybe that node is full of computing resources and computing power. So in this kind of time, generally speaking, Spark will wait for a period of time, and by default it is 3 seconds (not absolute, there are many cases, it will wait for different localization levels). In the end, it really can't wait, so it will choose a relatively poor localization level, for example, assign task to a node closer to the node where the data to be calculated, and then calculate. Localization level (1) PROCESS_LOCAL: process localization code and data are in the same process, that is, in the same executor; task for computing data is executed by executor, and data is in executor's BlockManager; performance is best (2) NODE_LOCAL: node localization code and data are in the same node; for example, data, as a HDFS block block, is on the node, while task runs in an executor on the node. Or data and task in different executor on one node; data needs to be transferred between processes; performance second (3) RACK_LOCAL: rack localization data and task on two nodes in the same rack; data needs to be transferred between nodes over the network; performance is poor (4) ANY: unlimited data and task can be anywhere in the cluster, and not in the same rack The worst performance is 9.2 data localization waiting time spark.locality.wait. By default, 3s will first adopt the best way, wait for 3 seconds and then downgrade, but it will not work. In the end, it will not work, and only the worst can be used. 9.3 how to adjust the parameters and test and modify the spark.locality.wait parameters. The default is 3s. You can increase the waiting time for each data localization level as follows. The default is the same as the spark.locality.wait time, and the default is 3s (see the description of the corresponding parameters on the spark official website. (as shown in the figure below) spark.locality.wait.nodespark.locality.wait.processspark.locality.wait.rack is set in the code: new SparkConf (). Set ("spark.locality.wait", "10") then submit the program to the spark cluster to run, pay attention to the log, the running log of the spark job, it is recommended that you first use client mode when testing, and you can see a more complete log locally. The log will show, starting task. PROCESS LOCAL 、 NODE LOCAL. For example, Starting task 0.0 in stage 1.0 (TID 2, 192.168.200.102, partition 0, NODE_LOCAL, 5254 bytes) observes that if most of the task's data localization levels are PROCESS_LOCAL, there is no need to adjust them. If it is found that many levels are NODE_LOCAL, ANY, then it is best to adjust the waiting time for data localization. It should be adjusted repeatedly, and then run again after each adjustment to see if the localization level of most of the task has been improved, and whether the running time of the entire spark job has been shortened. Note: when adjusting parameters and running tasks, don't put the cart before the horse, the localization level has been raised, but the running time of spark jobs has increased because of a large amount of waiting time, so don't adjust it. 10. Tuning executor memory Partition in 10.1 spark based on Spark memory Model

The memory of Executor is mainly divided into three blocks.

The first block is to let task execute the code we wrote ourselves using the

The second block is to let task pull the output of the task of the previous stage through the shuffle process, and then use it for aggregation and other operations.

The third block is for RDD caching to use

The memory model of 10.2 spark is the static memory model used by spark's executor before the spark1.6 version, but a unified memory model has been added since the beginning of spark1.6. The default value is false through the parameter spark.memory.useLegacyMode, which means that the new dynamic memory model is used; if you want to use the previous static memory model, change this value to true. 10.2.1 the static memory model actually divides one of our executor into three parts, one is the Storage memory area, one is the execution region, and the other is the other areas. If you use a static memory model, then use these parameters to control: spark.storage.memoryFraction: default 0.6spark.shuffle.memoryFraction: default 0.2. so the third part is 0.2. if we have a large amount of cache data, or our broadcast variable is relatively large, then we will increase the value of spark.storage.memoryFraction. But if there are no broadcast variables and there are more cache,shuffle in our code, then we need to increase the value of spark.shuffle.memoryFraction.

Disadvantages of static memory model

After we have configured the Storage memory area and the execution area, one of our tasks assumes that execution memory is out of use, but its Storage memory area is free, and the two cannot borrow from each other, which is not flexible enough, so our new unified memory model comes out. 10.2.2 Unified memory model dynamic memory model first reserves 300m of memory to prevent memory overflow. The dynamic memory model divides the overall memory into two parts, with this parameter indicating that the default value of spark.memory.fraction is 0.6, which means the other part is 0.4, and then the spark.memory.fraction part is divided into two small parts. These two small parts together account for 0. 6% of the total memory. These two parts are actually Storage memory and execution memory. It is provisioned by the parameter spark.memory.storageFraction, because the two share a total of 0.6. If the value of spark.memory.storageFraction is 0.5, then storage accounts for 0.5 of the 0.6, that is, executor accounts for 0.3.

What are the characteristics of the unified memory model?

Storage memory and execution memory can be borrowed from each other. You don't have to be as rigid as the static memory model, but there are rules. why are all the injured storage? It is because the data in execution is needed immediately, while the data in storage is not necessarily needed immediately. 10.2.3 Task submission script reference

The following is an example of a spark-submit command, which you can refer to and adjust according to your actual situation.

Bin/spark-submit\-- master yarn-cluster\-- num-executors 100\-- executor-memory 6G\-- executor-cores 4\-- driver-memory 1G\-- conf spark.default.parallelism=1000\-- conf spark.storage.memoryFraction=0.5\-- conf spark.shuffle.memoryFraction=0.3\ 10.2.4 personal experience java.lang.OutOfMemoryErrorExecutorLostFailureExecutor exit code is 143executor losthearbeat time outshuffle file lost if you encounter the above problems, it is likely to be a memory problem. You can try to increase memory first. If you still can't solve it, listen to the next lesson on data skew tuning. The answer to the question about what is the common means of Spark tuning in big data's development is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.