VI. Spark--spark tuning 07/04 Update SLTechnology News&Howtos

VI. Spark--spark tuning

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

Introduction to spark tuning 1.1.What is spark tuning

The computing essence of spark is distributed computing, and the performance of programs is affected by any factors in the cluster, such as CPU, network bandwidth, memory and so on. In general, if the memory is large enough, other factors affect performance. Then when there is tuning demand, it is more because of insufficient resources, so it is necessary to adjust the use of resources and use resources more efficiently. For example, if the memory is too tight to store all the data (1 billion items), you need to adjust to the use of memory to reduce memory consumption.

1.2 main directions of spark tuning

Most of the work of optimizing the performance of Spark is to tune the use of memory. In general, Spark processes a small amount of data and enough memory, so as long as the network is normal, there are generally no major performance problems. However, performance problems with Spark applications often occur when computing for large amounts of data (data surge). This situation is often insatiable in the current environment, so it may cause the cluster to collapse.

In addition to memory tuning, has several ways to optimize performance. For example, if there is interaction with mysql during the use of spark, tuning should also take into account the performance of mysql.

1.3 main technical means of spark tuning

1. Use high-performance serialization class libraries. Aim to reduce the serialization time and the size of serialized data

2. Optimize the data structure. Objective to reduce memory footprint

3. Persist (RDD cache) and checkpoint the RDD that is used many times

4. Persistence level using serialization: MEMORY_ONLY does not serialize, MEMORY_ONLY_SER serializes.

MEMORY_ONLY takes up more memory space than MEMORY_ONLY_SER.

Note, however, that serialization increases the cost of using cpu, so make a tradeoff

5. Java virtual machine garbage collection tuning.

6. Shuffle tuning, 90% of the problems are caused by shuffle (this problem is serious in version 1.x, and the official website has been basically optimized in version 2.x, so this problem can be ignored in version 2.x)

Other ways to optimize performance:

Improve the parallelism of calculation

Broadcast shared data

The following will analyze the tuning methods of these six points

Diagnosis of spark memory usage 2.1 memory cost (object memory cost) 1, each java/scala object, consists of two parts, one is the object header, occupies 16 bytes, mainly contains some meta-information of the object, such as pointers to its class. The other is the object itself. If the object is small, such as int, its object head is larger than its own object itself. 2. The String object, which is 40 bytes more than its internal raw data, is used to save meta-information of string type. String internally uses an char array to save string sequences, as well as information such as array length. String uses UTF-16 encoding, so each character takes up 2 bytes. For example, String, which contains 10 characters, occupies 2 + 10 + 40 bytes. 3. Collection types, such as HashMap and LinkedList, use linked list data structures internally, and each data in the linked list is wrapped with Entry objects. The Entry object, which not only has an object header, but also a pointer to the next Entry, occupies 8 bytes. So in a word, this type also contains multiple objects internally, which takes up more memory. Because there are more objects, in addition to the data of the objects themselves take up memory, more objects will have more object headers, which takes up a lot of memory space. 4. Collections of basic data types, such as int collections, internally use the object's wrapper class Integer to store elements. 2.2 get spark program memory usage

Go to the driver log directory to view the program running log

Less ${spark_home} / work/app-xxxxxx/0/stderr observed something like the following: INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 320.9 KB, free 366.0 MB) 05:57:47 on 19-07-05 INFO MemoryStore: Block rdd_3_1 stored as values in memory (estimated size 320.9 MB, free 339.9 MB) 05:57:47 on 19-07-05 INFO Executor: Finished task 320.9 in stage 0.0 (TID 1). 313.2 bytes result sent to driver 05:57:48 on 19-07-05 INFO MemoryStore: Block rdd_3_0 stored as values in memory (estimated size 313.2 MB, free 313.2 MB) estimated size 320.9 KB: approximate amount of memory currently used free 366.0 MB: remaining free memory size

So you can know how the task uses memory.

III. Spark tuning techniques 2.1 use of high-performance serialization libraries 2.1.1 use of spark serialization

As a distributed system, spark, like other distributed systems, needs serialization. Serialization is an important part of any distributed system. If you use the serialization technology, the operation is very slow, and the amount of data after serialization is large, which will lead to a great decline in the performance of distributed system applications. Therefore, the first step in Spark performance optimization is to optimize serialization performance.

spark uses serialization in some places, such as shuffle, but spark makes a trade-off between convenience and performance. Spark defaults to java's serialization mechanism for convenience. Java's serialization mechanism has been mentioned before, with low performance, slow serialization and large data after serialization. Therefore, in general production, it is best to modify the serialization mechanism used by spark.

2.1.2 configure spark to use kryo to serialize

spark supports serialization using kryo. Kryo serializes faster and takes up about 10 times less space than java. But it is relatively not so convenient to use.

Configure spark to use kryo:

When spark reads the configuration, it reads the configuration file in the conf directory, and one of the spark-defaults.conf files is used to specify some of the working parameters of spark. Vim spark-defaults.confspark.serializer org.apache.spark.serializer.KryoSerializer is configured to use kryo. Of course, you can also use conf objects in spark programs to set the optimization of conf.set ("spark.serializer", "org.apache.spark.serializer.KryoSerializer") 2.1.3 kryo class libraries.

(1) optimize cache size

If you register a serialized custom type that is very large, such as containing more than 100 fields, it will cause the serialized object to be too large. At this point, the kyro itself needs to be optimized. Because the internal cache of kyro itself is not enough to hold such a large object.

Setting: the spark.kryoserializer.buffer.max parameter can be adjusted to a higher value.

(2) register custom types in advance

When using kryo, it is best to register classes that need to be serialized in advance for higher performance, such as:

Note: this is basically for custom classes, and when writing spark projects in scala, there are not many custom classes involved, unlike java

2.2 Overview of optimizing data structures 2.2.1

The main purpose of optimizing the data structure is to avoid the extra memory overhead caused by syntax features.

Core: optimize the local data used inside the operator function or the data outside the operator.

Objective: to reduce the consumption and occupation of memory.

2.2.2 specific means

(1) give priority to arrays and strings over collection classes.

That is, using array rather than ArrayList,LinkedList,hashMap uses int [] to save memory than List. As mentioned earlier, collection classes contain more extra data and complex class structures, so they take up a lot of memory. The purpose of this move is to simplify the structure. To meet the needs of use, the simpler the better.

(2) convert objects to strings.

In an enterprise, data such as HashMap,List is uniformly concatenated into strings in a special format using String. For example: Map persons = new HashMap () optimized to: id:name,address,idCardNum,family. | id:name,address,idCardNum,family.

(3) avoid using multi-layer nested object structures.

The above example of public class Teacher {private List students = new ArrayList ()} is not good because there are a large number of small Student objects nested inside the Teacher class. Improvement: convert to json and process the string {"teacherId": 1 studentId.}]}

(4) for scenarios that can be avoided, use int instead of String

Although String performs better than List, int consumes less memory. For example, the database primary key, id, is recommended to use a self-incrementing primary key instead of uuid. 2.3 RDD cach

This is very simple, mainly to cache the repeatedly used RDD in memory to avoid repeated calculation when it is used again. For the implementation method, see the previous article on spark core.

2.4 using serialization for caching

By default, when RDD caching, the RDD object is not serialized, that is, the persistence level is MEMORY_ONLY. It is recommended that you use MEMORY_ONLY_SER for persistence because it also serializes and takes up less memory space after serialization. For the implementation method, see the previous article on spark core.

2.5 jvm tuning 2.5.1 background

If persists a large amount of data when persisting RDD, then garbage collection of Java virtual machines may become a performance bottleneck. The Java virtual machine performs garbage collection on a regular basis, and at this time it tracks all Java objects, and during garbage collection, it finds those objects that are no longer in use, cleans up old objects, and makes room for new ones.

The performance overhead of garbage collection is proportional to the number of objects in memory. It is also important to note that before tuning the Java virtual machine, you must do the other tuning work above in order to make sense. Because the above tuning work is to save memory overhead and use memory better and more efficiently. The benefits of the above optimization are much greater than those of jvm tuning. And jvm tuning is good, but the upper application does not have a good way to use memory, and jvm optimization is useless.

2.5.2 gc principle

Mentioned here, it is more for readers to understand this principle, Baidu can find it casually, there is no repetition here.

2.5.3 detecting garbage collection

We can monitor garbage collection, including how often and how long it takes for each collection.

In the spark-submit script, add a configuration:

-- conf "spark.executor.extraJavaOptions=-verbose:gc-XX:+PrintGCDetails-XX:+PrintGCTimesStamps" Note: output to worker log, not driver log. / usr/local/spark-2.1.0-bin-hadoop2.7/work/app-20190705055405-0000amp 0 this is the driver log / usr/local/spark-2.1.0-bin-hadoop2.7/logs this is the worker log 2.5.4 optimized Executor memory ratio

the most important tuning for GC tuning is the ratio of the memory space occupied by the RDD cache to the memory space occupied by the operator execution is the memory space used to create the object. By default, Spark uses 60% of the memory space of each Executor to cache RDD, so objects created during task execution have only 40% memory space to hold objects.

in this case, it is very likely that due to insufficient memory, the object created by task is too large, resulting in 40% insufficient memory space, triggering the Java virtual machine garbage collection operation. In extreme cases, garbage collection operations are triggered frequently.

According to the actual situation, can increase the storage space of objects and reduce the probability of gc occurrence.

Conf.set ("spark.storage.memoryFraction", 0.5) reduces RDD cache footprint to 50% 2.6 shuffle

in previous spark1.x versions, if there is a shuffle, then each map task will partition the map results according to the number of result task (also known as reduce task), give different result task processing, and each partition produces a file. When there is a large number of map, a large number of files will be generated, which can cause performance problems.

in spark2.x, put all the data output from a map task in a file, and then add an index file to identify the location of different partition data in the file, thus ensuring that a task produces only one file. So as to reduce the pressure of IO

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.