How to tune by serialization in Spark 04/28 Update SLTechnology News&Howtos

How to tune by serialization in Spark

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to serialize Spark tuning, I believe that most people do not understand, so share this article for your reference, I hope you will learn a lot after reading this article, let's go to know it!

Serialization plays an important role in any distributed application. A very slow serialization process or a serialization format that consumes a lot of bytes can greatly slow down the computation. Usually this is the first thing to optimize your spark application. The goal of Spark is to find a balance between java types and performance directly and easily in your operations. Currently, spark provides two serialized libraries:

1.Java serialization: by default, spark uses Java's ObjectOutputStream framework to serialize objects. Can be applied to any self-created class that inherits java.io.Serializable. You can also control the performance of your own serialization by inheriting java.io.Externalizable more closely. JAVA serialization, while flexible, is usually very slow, and it can lead to large serialization formats for many classes.

2.Kryo serialization:Spark can also use Kryo library (version 2) to accelerate serialization. Kryo serialization is faster and more compact than java (often 10 times), but it doesn't support all serialization types, and you need to register the classes you use in your program in advance for best performance.

You can change the serialization of spark by using SparkConf. This setting not only affects the Shuffle data transferred between worker but also serializes the RDD that is ready to be written to disk. The main reason why Kryo is not the default serialization method is that custom registration is required. We recommend using it in any network-intensive application.

Spark automatically includes Kryo for most common scala classes.

Val conf = new SparkConf (). SetMaster (...). SetAppName (...)

Conf.registerKryoClasses (Array (classOf [MyClass1], classof [MyClass2]))

Val sc = new SparkContext (conf)

The https://github.com/EsotericSoftware/kryo link documentation describes more advanced kryo registration options, such as adding custom serialization code.

If your object is very large, you need to add spark.kryoserializer.buffer. This value is larger than the largest object you want to serialize.

Finally, if you don't register your custom type with Kyro, Kyro will continue to work, but he will save the full class name of each of your objects, which is wasteful.

For spark's support for Kyro configuration, please refer to.

Http://spark.apache.org/docs/1.6.0/configuration.html#compression-and-serialization

These are all the contents of the article "how to tune Serialization in Spark". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.