What's the difference between Spark shuffle and hadoop shuffle? 07/19 Update SLTechnology News&Howtos

What's the difference between Spark shuffle and hadoop shuffle?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly talks about "what's the difference between Spark shuffle and hadoop shuffle". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what's the difference between Spark shuffle and hadoop shuffle?"

What is the relationship between Q1:AppClient, worker and master?

AppClient is the representative of the program that should be used on the Client machine when SparkContext.runJob in StandAlone mode. It needs to complete the registerApplication and other functions of the program.

When the program is registered, Master will send a message to the client through Akka to start Driver

Manage Task in Driver and control Executor on Worker to work together

Is there a big difference between Q2:Spark 's shuffle and hadoop's shuffle?

Spark's Shuffle is a kind of shuffle in a strict sense. In Spark, Shuffle is a dependency with RDD operations. The content of each partition element in the parent RDD on the Lineage is handed over to multiple child RDD.

Shuffle in Hadoop is a relatively vague concept. After the introduction of the Mapper phase, handing the data to Reducer will result in Shuffle, the first of the three stages of Shuffle,Reducer.

What about Q3:Spark 's HA?

For HA of Master, the Worker node is automatically HA in Standalone mode, and Zookeeper is generally used for HA of Master.

Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected "leader" and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master's state, and then resume scheduling. The entire recovery process (from the time the the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling new applications-applications that were already running during Master failover are unaffected

For Yarn and Mesos modes, ResourceManager generally uses ZooKeeper for HA

At this point, I believe you have a deeper understanding of "what's the difference between Spark shuffle and hadoop shuffle". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.