What are the advantages of Spark and Hadoop MapReduce 04/15 Update SLTechnology News&Howtos

What are the advantages of Spark and Hadoop MapReduce

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what are the advantages of Spark and Hadoop MapReduce". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the advantages of Spark and Hadoop MapReduce"?

1. The calculation speed is fast

The first thing big data pursues is speed. How fast is Spark? Officially, "Spark allows applications in a Hadoop cluster to run 100x faster in memory, even on disk." Some readers may sigh when they see this, indeed, in the field of iterative computing, Spark is much faster than MapReduce, and the more iterations, the more obvious the advantage of Spark. This is because Spark takes advantage of the increasing memory of the server and achieves a performance improvement by reducing disk Imax O. They put all the intermediate processing data in memory and store them in batches on the hard disk only when necessary. Readers may ask: if the application is very large, how much GB can be stored in memory? Answer: what? GB? Currently, the memory of the IBM server has been expanded to several TB.

2. Flexible application and easy to use

Do you know why AMPLab's Lester gave up MapReduce? Because he needs to devote a lot of energy to the programming model of Map and Reduce, which is extremely inconvenient. In addition to simple Map and Reduce operations, Spark also supports SQL queries, streaming queries and complex queries, such as out-of-the-box machine learning algorithms. At the same time, users can seamlessly match these capabilities in the same workflow, and the application is very flexible.

The code for the core part of Spark is 63 Scala files, which is very lightweight. And allow Java, Scala, Python developers to work in their own familiar language environment, through the establishment of Java, Scala, Python, SQL (to respond to interactive queries) standard API to facilitate the use of various industries, but also includes a large number of out-of-the-box machine learning libraries. It comes with more than 80 high-level operators that allow interactive queries in Shell. Even if you are a novice, you can easily use it.

3. Be compatible with competitors

Spark can be run independently, in addition to running in the current YARN cluster management, but also can read any existing Hadoop data. It can run on any Hadoop data source, such as HBase, HDFS, and so on. With this feature, it is much more convenient for users who want to migrate from Hadoop applications to Spark. Spark has the breadth of mind of compatible competitors, why not worry about big things?

4. Excellent real-time processing performance

MapReduce is more suitable for offline data (of course, after YARN, Hadoop can also use other tools for streaming calculation). Spark supports real-time stream computing and relies on Spark Streaming for real-time data processing. Spark Streaming has a powerful API that allows users to quickly develop streaming applications. And unlike other streaming solutions, such as Storm,Spark Streaming, you can do a lot of recovery and delivery without additional code and configuration.

5. The community has made great contributions.

From the evolution of the version of Spark, it is enough to show the exuberant vitality of this platform and the activity of the community. Especially since 2013, Spark has entered a period of rapid development, with a significant increase in code base submission and community activity. In terms of activity, Spark ranks in the top three of all Apache Foundation open source projects, and Spark's code base is the most active compared to other big data platforms or frameworks.

Spark attaches great importance to community activities, the organization is also very standardized, and Spark-related meetings will be held regularly or irregularly. There are two kinds of meetings: one is Spark Summit, which is so influential that it can be called the global summit of Spark*** technicians, which has held three consecutive Summit conferences in San Francisco in 2013 and 2015; the other is the small Meetup events held by the Spark community around the world from time to time. Spark Meetup is also held regularly in some big cities in China, such as Beijing, Shenzhen, Xi'an and other places. Readers can follow the local official Wechat account to participate.

Applicable scenarios for Spark

From the perspective of big data's handling of demand, big data's business can be divided into the following three categories:

(1) complex batch data processing, usually with a time span of tens of minutes to hours.

(2) the time span of interactive query based on historical data is usually between tens of seconds and minutes.

(3) the time span of data processing based on real-time data stream usually ranges from hundreds of milliseconds to several seconds.

At present, there are many relatively mature open source and commercial software to deal with the above three scenarios: * business, you can use MapReduce for batch data processing; the second business, you can use Impala for interactive query; for the third kind of streaming data processing, you can think of the professional streaming data processing tool Storm. But here is a very important problem: for most Internet companies, they generally encounter the above three scenarios at the same time, if different processing technologies are used to face these three scenarios, then the input / output data of these three scenarios are seamlessly shared, format conversion may be required between them, and each open source software requires a development and maintenance team, thus increasing the cost. Another inconvenience is that it is difficult to coordinate resource allocation among systems in the same cluster.

So, is there a software that can handle the above three scenarios at the same time? Spark can, or has such potential. Spark supports complex batch, interoperability and stream computing, and is compatible with distributed file systems such as HDFS and Amazon S3, which can be deployed on popular cluster resource managers such as YARN and Mesos.

Starting from the design concept of Spark (memory-based iterative computing framework), it is most suitable for applications with iterative operations or where specific data sets need to be operated multiple times. And the more iterations, the larger the amount of data read, the more obvious the application effect of Spark. As a result, Spark is good at "iterative" applications such as machine learning, dozens of times faster than Hadoop MapReduce. In addition, because of the characteristic of storing intermediate data in memory, the processing speed of Spark Streaming is very fast, so it can also be used in situations where real-time processing of big data is needed.

Of course, there are occasions where Spark does not apply. For applications with asynchronous fine-grained state updates, such as Web service storage or incremental Web crawlers and indexes, it is not suitable for incremental modified application models. Spark is also not suitable for processing super large amounts of data. The term "super large" here is relative to the memory capacity of the cluster, because Spark stores the data in memory. Generally speaking, data above 10TB (single analysis) can be regarded as "super-large" data.

Generally speaking, for the data center of small and medium-sized enterprises, Spark is a good choice when the amount of data calculated at a single time is small. In addition, Spark is not suitable for hybrid cloud computing platforms, because the network transmission of hybrid cloud computing platforms is a big problem. Even if there is a dedicated broadband to transfer data between the cloud Cluster and the local Cluster, it is still less than the memory read speed.

Thank you for reading, the above is the content of "what are the advantages of Spark and Hadoop MapReduce". After the study of this article, I believe you have a deeper understanding of the advantages of Spark and Hadoop MapReduce, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.