How to realize the persistence and caching of Spark-RDD in big data's development 07/15 Update SLTechnology News&Howtos

How to realize the persistence and caching of Spark-RDD in big data's development

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Big data development of Spark-RDD persistence and caching how to achieve, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

1.RDD caching mechanism cache, persist

One reason why Spark is so fast is that RDD supports caching. After successful caching, if the dataset is used in subsequent operations, it is fetched directly from the cache. Although there is a risk of cache loss, due to the dependency between RDD, if the cache data of a partition is lost, you only need to recalculate that partition.

The operators involved: persist, cache, unpersist; are all Transformation

Caching is the writing of calculation results to different media, and the user-defined storage level (the storage level defines the media stored in the cache, which currently supports memory, out-of-heap memory, disk)

Through caching, Spark avoids repeated calculations on RDD and can greatly improve the speed of computing; RDD persistence or caching is one of the most important features of Spark. It can be said that caching is the key factor for Spark to build iterative algorithms and fast interactive queries.

One of the reasons why Spark is so fast is to persist (or cache) a dataset in memory. When a RDD is persisted, each node stores the calculated shard results in memory and reuses them in other Action actions on the dataset (or derived dataset). This makes the subsequent action faster; use the persist () method to mark a RDD as persistent. The reason for saying "marked as persistent" is that where the persist () statement appears, the generated RDD will not be calculated and persisted immediately, but the result will not be persisted until the first action triggers the real calculation. A RDD to be persisted can be marked by the persist () or cache () method. When persistence is triggered, the RDD will be retained in the memory of the compute node and reused.

When to cache data, there is a tradeoff between space and speed. In general, if a RDD is needed for multiple actions and its computational cost is high, then the RDD should be cached

The cache may be lost, or the data stored in memory may be deleted due to insufficient memory. The fault-tolerant mechanism of RDD's cache ensures that the certificate calculation can be executed correctly even if the cache is lost. Through a series of RDD-based transformations, lost data is recalculated. Each Partition of RDD is relatively independent, so only the missing parts need to be calculated, not all the Partition needs to be recalculated.

You need to configure two parameters to start out-of-heap memory:

Spark.memory.offHeap.enabled: whether to enable out-of-heap memory. The default is false and needs to be set to true.

Spark.memory.offHeap.size: the size of the out-of-heap memory space. The default value is 0 and needs to be set to a positive value.

1.1 Cache level

Spark supports multiple cache levels:

Storage Level (storage level) Meaning (meaning) the default cache level of MEMORY_ONLY, which stores RDD in JVM in the form of deserialized Java objects. If there is not enough memory space, some of the partitioned data will no longer be cached. MEMORY_AND_DISK stores the RDD in JVM as a deserialized Java object. If there is not enough memory space, store uncached partition data to disk and read from disk when you need to use these partitions. MEMORY_ONLY_SER stores RDD as serialized Java objects (each partition is an array of byte). This approach saves storage space than deserializing objects, but increases the computational burden on CPU when reading. Only Java and Scala are supported. MEMORY_AND_DISK_SER is similar to MEMORY_ONLY_SER, but overflowed partition data is stored to disk instead of being recalculated when used. Only Java and Scala are supported. DISK_ONLY only caches RDDMEMORY_ONLY_2 on disk, and MEMORY_AND_DISK_2 has the same function as the corresponding level above, but makes replicas on two nodes in the cluster for each partition. OFF_HEAP is similar to MEMORY_ONLY_SER, but stores data in out-of-heap memory. This requires out-of-heap memory to be enabled.

You need to configure two parameters to start out-of-heap memory:

Spark.memory.offHeap.enabled: whether to enable out-of-heap memory. The default is false and needs to be set to true.

Spark.memory.offHeap.size: the size of the out-of-heap memory space. The default value is 0 and needs to be set to a positive value.

1.2 use caching

There are two ways to cache data: persist and cache. What cache calls internally is also persist, which is a specialized form of persist and is equivalent to persist (StorageLevel.MEMORY_ONLY). Examples are as follows:

/ / all storage levels are defined in the StorageLevel object fileRDD.persist (StorageLevel.MEMORY_AND_DISK) fileRDD.cache ()

The cached RDD has a green dot in the DAG diagram.

1.3 remove cache

Spark automatically monitors cache usage on each node and deletes old data partitions according to the least recently used (LRU) rule. Of course, you can also delete it manually using the RDD.unpersist () method.

The operator involved in 2.RDD fault tolerance mechanism Checkpoint2.1: checkpoint; is also Transformation

In addition to persistence operations, Spark also provides a mechanism for checkpointing, which is essentially by writing RDD to a highly reliable disk, the main purpose of which is fault tolerance. Checkpoints are achieved by writing data to the HDFS file system

Checkpoint capabilities of RDD. If the Lineage is too long, the cost of fault tolerance will be too high, so it is better to do checkpoint fault tolerance in the intermediate stage. If there is a problem with the node and the partition is lost,

The RDD that does the checkpoint starts to redo the Lineage, which reduces the overhead.

2.2 differences between cache and checkpoint

There is a significant difference between cache and checkpoint. The cache calculates the RDD and puts it in memory, but the dependency chain of RDD cannot be lost. When an executor goes down at a certain point, the RDD of the above cache will be lost, which needs to be calculated by replaying the dependency chain. The difference is that checkpoint puts

RDD is stored in HDFS and is a reliable multi-copy storage, so the dependency chain can be discarded, so the dependency chain is broken.

2.3 checkpoint fits the scene

The following scenarios are suitable for using checkpoint mechanisms:

The Lineage in DAG is too long, if it is recalculated, it will cost too much

You can get more benefits by doing Checkpoint on wide dependence.

Like cache, checkpoint is also lazy.

Val rdd1 = sc.parallelize (1 to 100000) / / set checkpoint directory sc.setCheckpointDir ("/ tmp/checkpoint") val rdd2 = rdd1.map (_ * 2) rdd2.checkpoint// checkpoint is the rdd dependency rdd2.dependencies (0) .rddrdd2.dependencies (0) before lazy operates rdd2.isCheckpointed// checkpoint. Rdd.collect// executes action once, triggering checkpoint's execution rdd2.countrdd2.isCheckpointed// to view RDD dependencies again. You can see that after checkpoint, the lineage of RDD is truncated and becomes rdd2.dependencies (0) .rddrdd2.dependencies (0) .rdd.dependency / / View the checkpoint file rdd2.getCheckpointFile that RDD depends on. After reading the above, have you mastered how to implement the persistence and caching of Spark-RDD in big data's development? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.