In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
I. complete decryption of Broadcast
1. Broadcast is to send data from one node to another.
2. Broadcast is distributed shared data, and the Broadcast variable exists by default as long as the program runs, because the underlying Broadcast is managed by BlockManager, but the Broadcast variable can also be destroyed manually.
3, Broadcast is generally used to deal with shared configuration files, general Dataset, commonly used data structures, etc., but it is not suitable to store too large data in Broadcast,Broadcast will not memory overflow, because the StorageLevel of its data preservation is MEMORY_AND_DISK, even so, can not put too large data, because the network IO and possible single point of pressure will be very great.
4. The broadcast Broadcast variable is read-only, which maintains the consistency of the data.
5. The use of Broadcast:
* {
* scala > val broadcastVar = sc.broadcast (Array (1,2,3))
* broadcastVar: org.apache.spark.broadcast.Broadcast.Array [int] = Broadcast (0)
*
* scala > broadcastVar.value
* res0: Array [Int] = Array (1,2,3)
*}
6. Broadcast in HttpBroadcast mode: at the beginning, the data is stored in the file system of Driver. Driver will create a folder locally to store the data in Broadcast, and then launch HttpServer to access the data in the folder. At the same time, write to BlockManager to get BlockId (BroacastBlockId). When the Task in the first Executor wants to access the Broadcast variable, it will access the data through HttpServer to the Driver, and then register in the BlockManager in the Executor, so that when the subsequent Task needs to access the Broadcast variable, it will first query whether it exists in the BlockManager of the current Executor, and obtain the data directly if it exists.
7. BroadcastManager is used to manage Broadcast, which is created when SparkContext creates SparkEnv. When BroadcastManager is instantiated, a BroadcastFactory factory is created to build a specific Broadcst type, which defaults to TorrentBroadcastFactory.
8. HttpBroadcast has a single point of failure and network IO performance problems, so TorrentBroadcast is used by default to start storing the data on the driver side. If point A needs to access the data, it will go to the driver side to get the data, and then store a copy locally. Node An also has a copy, and the A node becomes a data source, reducing the node pressure.
9. TorrentBroadcast divides the data in the Broadcast into different block according to BLOCK_SIZE (default 4m), then stores the block information, that is, the meta information, in the BlockManager on the driver side, and informs the BlockManagerMaster that the meta information has been stored.
2. Decryption of Broadcast source code
When broadcasting data, the broadcast method of SparkContext is called. Inside the method, Broadcast is created by BroadcastManager management, while BroadcastManager is managed by SparkEnv.
SparkEnv is created by createSparkEnv in SparkContext, which in turn calls the createDriverEnv method of SparkEnv, which eventually calls the method of create itself to build some of the required builds. The BoradcastManager that manages the Broadcast is created in this method.
When creating a BroadcastManager instance, call the initialize initialization method to create a BoradcastFactory. The default is TorrentBroadcastFactory: how
After the BroadcastManager is initialized, you can call the newBroadcast method to create a corresponding Broadcast (TorrentBroadcast) according to BroadcastFactory to broadcast the data:
The newBroadcast method of TorrentBroadcastFactory creates an instance of TorrentBroadcast. When we broadcast the data, we call the writeBlocks method to divide the broadcast data into multiple block blocks (default is 4m), and store these block blocks on the driver side:
When the value of the broadcast variable is obtained, the getValue method of the corresponding Broadcast will be called. In TorrentBroadcast, the readBroadcastBlock method will first obtain the data according to the BroadcastBlockId in the local BlockManager. If it cannot be obtained, the readBlocks method will be called.
The readBlocks method in TorrentBroadcast will obtain the corresponding block fast data from the Driver side or other Executor, and then save the acquired block data to the BlockManager of Executor:
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.