Using High performance Serialization Class Library for spark performance Optimization 07/19 Update SLTechnology News&Howtos

Using High performance Serialization Class Library for spark performance Optimization

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

If you use serialization technology, the serialization operation is slow or the amount of data after serialization is still very large, then the performance of the distributed application will be greatly degraded. Spark itself will serialize the data in some places, such as shuffle write disk, and if our operator function uses external data, such as Java built-in types. Or a custom type) then you also need to make it serializable.

By default, spark uses the serialization mechanism provided by Java itself, based on objectoutputStream and objectinputstream, because this approach is natively provided by Java and is easy to use. However, the performance of the Java serialization mechanism is not high, the speed is slow, and the serialized data is still relatively large, as long as your class implements the Serializable interface, then it can be serialized.

Spark supports serialization using Kryo types, which is faster than the Java serialization mechanism and has a smaller amount of data after serialization. The reason why the Kryo serialization mechanism is not the default serialization mechanism is that although some types implement the Seriralizable interface, they may not necessarily be serialized; in addition, if you want to get the best performance, Kryo also requires you to register all the types you need to serialize in your Spark application.

If you want to use the Kryo serialization mechanism, first set a parameter with SparkConf, using new SparkConf (). Set ("spark.serializer", "org.apache.spark.serializer.KryoSerializer"), which sets the serializer of Spark to KryoSerializer. In this way, some of Spark's internal operations, such as Shuffle, use the Kryo class library for high-performance, fast, low-memory serialization.

When using Kryo, it requires classes that need to be serialized and registered in advance for best performance-- if you don't register, Kryo must always keep the fully qualified name of the type, which takes up a lot of memory. Spark automatically registers Kryo by default for the types commonly used in Scala, all in the AllScalaRegistry class.

However, for example, if you use an external custom type object in your own operator, you still need to register it.

(in fact, the following is wrong, because counter is not shared, so the cumulative function cannot be achieved.)

Val counter = new Counter ()

Val numbers = sc.parallelize (Array (1,2,3,4,5))

Numbers.foreach (num = > counter.add (num))

If you want to register a custom type, you can use the following code:

Scala version:

Val conf = new SparkConf (). SetMaster (...). SetAppName (...)

Conf.registerKryoClasses (Array (classof [counter]))

Val sc = new SparkContext (conf)

Java version:

SparkConf conf = new SparkConf (). SetMaster (...). SetAppName (...)

Conf.registerKryoClasses (Counter.class)

JavaSparkContext sc = new JavaSparkContext (conf)

Optimize the use of Kryo class libraries

1. Optimize the cache size

If you register a custom type to serialize, it is very large, such as containing more than 100 field. Then it causes the object to be serialized to be too large. At this point, the Kryo itself needs to be optimized. Because the cache inside Kryo may not be enough to hold such a large class object. At this point, you need to call the SparkConf.set () method to set the value of the spark.kryoserializer.buffer.mb parameter and increase it.

By default, it has a value of 2, which means a maximum of 2m objects can be cached and then serialized. You can turn it up if necessary. For example, set it to 10.

2. Pre-register custom types

Although the Kryo class library works without registering custom types, a copy of its fully qualified class name is saved for each object it serializes. At this point, it consumes a lot of memory. Therefore, it is usually recommended to pre-register the custom class whose number you want to serialize.

In what scenarios do you use Kryo to serialize class libraries

First of all, what we are talking about here are some common scenarios of Spark and some special scenarios, such as the persistence of RDD.

So, the use scenario of the Kryo serialization class library here is the case where the operator function is used by the external big data. For example, we define an external object that encapsulates all the configurations of the application, such as a custom MyConfiguration object that contains 100m of data. Then, in the operator function, this external large object is used.

At this point, if, by default, Spark uses the java serialization mechanism to serialize such external large objects, it will result in slow serialization, and the serialized data is still relatively large, taking up more memory space.

Therefore, in this case, it is more appropriate to switch to the Kryo serialization class library to serialize large external objects. First, serialization will be faster; second, it will reduce the memory space taken up by serialized data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.