Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to implement spark kryo Serialization

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to achieve spark kryo serialization". In daily operation, I believe many people have doubts about how to achieve spark kryo serialization. The editor consulted all kinds of data and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts of "how to achieve spark kryo serialization"! Next, please follow the editor to study!

Broadcast large variables, each executor can correspond to a blockmanager stored variables, although we reduce the network transmission, reduce the overall memory space, but also further reduce the network transmission and memory space, so we can use the kryo serialization mechanism! You can also further optimize this serialization format.

By default, Spark uses Java's serialization mechanism, ObjectOutputStream / ObjectInputStream, and object I / O stream mechanism to serialize. The advantage of this default serialization mechanism is that it is more convenient to deal with; there is no need for us to do anything manually, but the variables you use in the operator must be serializable and implement the Serializable interface. But the disadvantage is that the default serialization mechanism is not efficient, the serialization speed is relatively slow, and the memory space of the serialized data is still relatively large.

Spark supports the use of Kryo serialization mechanisms to optimize the serialization format manually. The Kryo serialization mechanism is faster than the default Java serialization mechanism, and the serialized data is smaller, which is about 1 prime 10 of the Java serialization mechanism. So after Kryo serialization optimization, less data can be transmitted over the network, and the memory resources consumed in the cluster can be greatly reduced.

The Kryo serialization mechanism, once enabled, will take effect in several places:

The external variables used in the operator function must be serialized when they are to be transmitted.

External variables used in the operator function, after using Kryo: optimize the performance of network transmission, can optimize the memory occupation and consumption in the cluster, the operator function uses external variables, will serialize, use Kryo

Serialize when RDD is persisted, StorageLevel.MEMORY_ONLY_SER

Persist RDD to optimize memory usage and consumption; the less memory consumed by persistent RDD, when task executes, the objects created will not frequently fill up memory, and GC will occur frequently. When the persistence level of serialization is used, Kryo is used to further optimize the efficiency and performance of serialization when serializing each RDD partition into a large byte array.

Shuffle

Shuffle: can optimize the performance of network transmission. In the shuffle operation of task between stage, the task between nodes will pull and transfer a large number of files through the network. At this time, since the data is transmitted through the network, it is possible to serialize, so Kryo will be used.

SparkConf.set ("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

Set a property in SparkConf, the spark.serializer,org.apache.spark.serializer.KryoSerializer class

The reason why Kryo is not used as the default serialization library is about to appear: mainly because Kryo requires that if you want to achieve its best performance, then you must register your custom class (for example, if you use an object variable of an external custom type in your operator function, then you must register your class, otherwise Kryo will not achieve the best performance). It feels troublesome, so there is no default.

Register what you use, need to serialize through Kryo, some custom classes, SparkConf.registerKryoClasses ()

Use in the project: .set ("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .registerKryoClasses (new Class [] {CategorySortKey.class}) this ends the study of "how to implement spark kryo serialization", hoping to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report