Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand spark tuning

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

How to understand spark tuning? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Data Serialization

Spark now provides Java serialization and Kryo serialization libraries

Java serialization is slow, Kryo serialization is fast.

Memory Tuning

Determining Memory Consumption

The best way to measure the amount of memory consumed by your dataset is to create a RDD, put it into cache, and draw a conclusion by looking at your driver program SparkContext logs.

Tuning Data Structures

The first way to reduce memory consumption is to avoid using expensive java features, such as pointer-based data structures and encapsulated objects. There are several ways to avoid:

1. Design your data structure as an array of objects, or primitive types, rather than java or scala collection classes (such as HashMap)

The fastutil library provides convenient collection classes for raw data types, and these classes are compatible with the java standard library. (http://fastutil.di.unimi.it/)

two。 If possible, avoid using embedded data structures with many small objects

3. Consider using numeric id or enumerated objects instead of string as key

4. If the memory used for spark is less than 32g, set JVM flag-XX:+UseCompressedOops to change the pointer size from 8 byte to 4 byte.

You can add this parameter to spark-env.sh.

Serialized RDD Storage

When your object is still large after being tuned efficiently, a simple way to reduce memory usage is to store the object in a serialized format, specified using the level of serialization in RDD persistence API, such as MEMORY_ONLY_SER.

Spark stores each RDD block as a large byte array. The only disadvantage of serializing stored data is that access time is slower, which should be attributed to the fact that the server has been busy deserializing each object.

If you want to keep your data in memory in serialized format, we highly recommend using the Kryo library, as this results in a much smaller file size than java serialization. (certainly smaller than native java object without serialization)

This is the answer to the question about how to understand spark tuning. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report