Spark (3): blockManager, broadcast, cache, checkpoint 07/19 Update SLTechnology News&Howtos

Spark (3): blockManager, broadcast, cache, checkpoint

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

BlockManager

BlockManager is launched on Driver and executor respectively, in which driver has references to blockManager on all executor; blockManager on all executor holds references to blockManager on driver; blockManagerSlave will constantly send heartbeats to blockManagerMaster, update block information, etc. When the BlockManager object is created, the MemoryStore and DiskStore objects are created to access the block. If there is enough memory in memory, use MemoryStore storage. If not, spill to disk and store through DiskStore. DiskStore has a DiskBlockManager,DiskBlockManager that is mainly used to create and hold the mapping between the logical blocks and the blocks on disk, and a logical block is mapped to a file on disk through BlockId. The diskManager.getFile method is called in DiskStore. If the subfolder does not exist, it will be created. The folder will be named (spark-local-yyyyMMddHHmmss-xxxx, xxxx is a random number), and all block will be stored in the created folder. Compared with DiskStore, MemoryStore needs to calculate the file path according to block id hash and store block in the corresponding file. MemoryStore management block is very simple: a hash map is maintained inside MemoryStore to manage all block, and block is stored in hash map with block id as key. On the other hand, it is very simple to get the block from the MemoryStore, as long as the value corresponding to the block id is extracted from the hash map. If the GET operation exists in local, it will return directly. If you get a Block locally, it will first judge that if it is useMemory, it will directly take it out of memory, if it is useDisk, it will retrieve it from disk, and then determine whether to cache it in memory according to useMemory to facilitate next time. If local does not exist, get it from other nodes. Of course, meta information is stored on drive. To obtain the location of the node where the Block is located according to the GETlocation protocol we mentioned above, and then get it on other nodes. Locks are added before the PUT operation to avoid multithreading. According to the storage level, the corresponding memoryStore or diskStore is called, and then the storage API is called on the specific memory. If there is a replication requirement, the data will be backed up to other machines. Cache, persist, checkpoint if you want to persist a RDD, simply call cache () and persist () on the RDD. The cache () method says that all the data in the RDD is attempted to be persisted to memory in a non-serialized way. The persist () method says that the persistence level is manually selected and persisted in the specified manner. The default cache level is StorageLevel.MEMORY_ONLY, which is the default level of cache. Checkpoint is to persist data to HDFS or hard disk. There are also differences between rdd.persist (StorageLevel.DISK_ONLY) and checkpoint. Although the former can persist the partition of RDD to disk, the partition is managed by blockManager. Once the driver program execution is finished, the CoarseGrainedExecutorBackend stop,blockManager, the process where the executor is located, will also stop, and the RDD that has been cache to the disk will be emptied (the local folder used by the entire blockManager will be deleted). On the other hand, checkpoint persists RDD to HDFS or local folder, if it is not remove manually. (how do you remove checkpoint RDD? , which means that it can be used by the next driver program, while cached RDD cannot be used by other dirver program Broadcast, accumulator broadcast variables allow programmers to cache a read-only variable on each machine without passing variables between tasks. (note that it is a large read-only variable and cannot be modified.) Accumulator is an accumulator provided by spark, which, as its name implies, can only be incremented. Only driver can get the value of Accumulator (using the value method), and Task can only increase it (using + =) the accuracy of the result can only be ensured by using action once in the process of using the accumulator. If you need to use it more than once, use the cache or persist operation to break the dependency.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.