Analysis of how to realize the principle of Spark2.x BlockManager 07/03 Update SLTechnology News&Howtos

Analysis of how to realize the principle of Spark2.x BlockManager

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to achieve Spark2.x BlockManager principle analysis, the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

I. Overview

BlockManager is a module that is responsible for reading, writing and managing data at the bottom of Spark.

For each Spark task, the Driver node has a BlockManagerMaster instance, and there is a BlockManager instance corresponding to each Executor, which also constitutes a data management system of Master/Slaver architecture. For example, ShuffleWriter writes data to disk or content through BlockManager, and each Task also establishes a connection through BlockManger when pulling data, and then pulls data.

Here we first make a brief introduction to the principle of BlockManager.

II. Illustrating the overall structure of BlockManager

The architecture diagram is explained in detail below:

1. From the BlockManager principle architecture diagram, we can see that for each Spark task, Driver initializes a BlockManagerMaster instance, initializes and creates an BlockManagerMasterEndPoint instance. BlockManagerMasterEndpoint is a ThreadSafeRpcEndpoint class that receives message requests from Blockmanager in Executor and processes them accordingly. The implementation code in the SparkEnv class is as follows:

Val blockManagerMaster = new BlockManagerMaster (registerOrLookupEndpoint (BlockManagerMaster.DRIVER_ENDPOINT_NAME, new BlockManagerMasterEndpoint (rpcEnv, isLocal, conf, listenerBus), conf, isDriver)

two。 A set of HashMap data structure BlockManagerInfo information is managed in BlockManagerMasterEndPoint, and the corresponding relationship between BlockManagerId and BlockManagerInfo is saved. Here, it is equivalent to the metadata information of Block in each Executor. For example, after each BlockManager end adds or deletes a Block, the corresponding metadata information needs to be updated. The BlockManagerMasterEndPoint class implementation code is as follows:

/ / Mapping from block manager id to the block manager's information. Privateval blockManagerInfo = new mutable.HashMap [BlockManagerId, BlockManagerInfo]

The 3.BlockManagerInfo stores the status information of all the Block of the Executor. Here is also a HashMap structure, which stores the block and its BlockStatus information. The BlockManagerInfo class code is implemented as follows:

/ / Mapping from block id to its status.privateval _ blocks = new JHashMap [BlockId, BlockStatus]

To sum up, the Driver side actually maintains the block metadata information of each node through BlockManagerMaster. For example, the Block of each block manager will be updated here, such as adding, deleting, changing and other operations.

4. There is a BlockManager instance on the executor side, which has four important components. The general introduction here and the later source code analysis will be described in detail:

1) .DiskStore

Responsible for reading and writing disk data

2). MemoryStore

Responsible for reading and writing memory data

3). ConnectionManager:

Responsible for connecting to other BlockManger, for example, the ShuffleReader phase needs to pull data from the remote, which is responsible for the remote connection.

4). BlockTransferService

Here, ConnectionManager is responsible for data transmission after successful connection with other BlockManger.

The first thing to do after the 5.BlockManager is created is to register with BlockManagerMaster, and the corresponding BlockManagerInfo information will be added to its blockManagerInfo.

6. There is one thing to note here: when BlockManager writes data, it gives priority to writing data to memory. If there is not enough memory, it will write part of the data in memory to disk according to its own algorithm. In addition, if a copy of relication is specified, one copy of the data will be copied to another BlockManager using BlockManager, so there will be a situation where one Block has two BlockManger.

When 7.BlockManager reads data, such as the ShuffleReader phase, if it can read data locally, it will read locally, otherwise it will establish a connection with the remote BlockManager node through ConnectionManager. After the connection is successful, BlockTransferService will go to the BlockManager node to obtain the data.

8. Whenever the BlockManager side adds, deletes or modifies the data, it will send a message notification of BlockStatus changes to BlockManagerMaster, and then BlockManagerMaster will update the BlockManagerInfo metadata information maintained by itself.

On how to achieve the principle of Spark2.x BlockManager analysis to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.