Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Do you know the advantages of Spark compared with Hadoop MapReduce?

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

When it comes to big data's handling, I believe many people immediately think of Hadoop MapReduce. Yes, Hadoop MapReduce laid the foundation for big data's processing technology. In recent years, with the development of Spark, more and more voices mention Spark. What are the advantages of Spark over Hadoop MapReduce?

There are two ways of saying Spark and Hadoop MapReduce in the industry:

First, Spark will replace Hadoop MapReduce as the development direction of big data's processing in the future.

Second, Spark will combine with Hadoop to form a larger biosphere. In fact, the key applications of Spark and Hadoop MapReduce are different.

Compared with Hadoop MapReduce, Spark feels a little "green from the blue". Spark is developed on the Hadoop MapReduce model, in which we can clearly see the shadow of MapReduce. All Spark is not innovating from scratch, but standing on the shoulders of the giant "MapReduce". The meritorious deeds of a thousand years will be left for comment in the future. Let's put aside the controversy for a while and see what advantages we have compared to Hadoop MapReduce,Spark.

Spark and Hadoop MapReduce

1. The calculation speed is fast

The first thing big data pursues is speed. How fast is Spark? Officially, "Spark allows applications in a Hadoop cluster to run 100x faster in memory, even on disk." Some readers may sigh when they see this, indeed, in the field of iterative computing, Spark is much faster than MapReduce, and the more iterations, the more obvious the advantage of Spark. This is because Spark takes advantage of the increasing memory of the server and achieves a performance improvement by reducing disk Imax O. They put all the intermediate processing data in memory and store them in batches on the hard disk only when necessary. Readers may ask: if the application is very large, how much GB can be stored in memory? Answer: what? GB? Currently, the memory of the IBM server has been expanded to several TB.

2. Flexible application and easy to use

Do you know why AMPLab's Lester gave up MapReduce? Because he needs to devote a lot of energy to the programming model of Map and Reduce, which is extremely inconvenient. In addition to simple Map and Reduce operations, Spark also supports SQL queries, streaming queries and complex queries, such as out-of-the-box machine learning algorithms. At the same time, users can seamlessly match these capabilities in the same workflow, and the application is very flexible. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

The code for the core part of Spark is 63 Scala files, which is very lightweight. And allow Java, Scala, Python developers to work in their own familiar language environment, through the establishment of Java, Scala, Python, SQL (to respond to interactive queries) standard API to facilitate the use of various industries, but also includes a large number of out-of-the-box machine learning libraries. It comes with more than 80 high-level operators that allow interactive queries in Shell. Even if you are a novice, you can easily use it.

3. Be compatible with competitors

Spark can be run independently, in addition to running in the current YARN cluster management, but also can read any existing Hadoop data. It can run on any Hadoop data source, such as HBase, HDFS, and so on. With this feature, it is much more convenient for users who want to migrate from Hadoop applications to Spark. Spark has the breadth of mind of compatible competitors, why not worry about big things?

4. Excellent real-time processing performance

MapReduce is more suitable for offline data (of course, after YARN, Hadoop can also use other tools for streaming calculation). Spark supports real-time stream computing and relies on Spark Streaming for real-time data processing. Spark Streaming has a powerful API that allows users to quickly develop streaming applications. And unlike other streaming solutions, such as Storm,Spark Streaming, you can do a lot of recovery and delivery without additional code and configuration.

5. The community has made great contributions.

From the evolution of the version of Spark, it is enough to show the exuberant vitality of this platform and the activity of the community. Especially since 2013, Spark has entered a period of rapid development, with a significant increase in code base submission and community activity. In terms of activity, Spark ranks in the top three of all Apache Foundation open source projects, and Spark's code base is the most active compared to other big data platforms or frameworks.

Spark attaches great importance to community activities, the organization is also very standardized, and Spark-related meetings will be held regularly or irregularly. There are two kinds of meetings: one is Spark Summit, which is so influential that it can be called the summit of the world's top Spark technicians, which has held three consecutive Summit conferences in San Francisco from 2013 to 2015, and the other is small Meetup events held by the Spark community around the world from time to time. Spark Meetup is also held regularly in some big cities in China, such as Beijing, Shenzhen, Xi'an and other places. Readers can follow the local official Wechat account to participate. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

Applicable scenarios for Spark

From the perspective of big data's handling of demand, big data's business can be divided into the following three categories:

(1) complex batch data processing, usually with a time span of tens of minutes to hours.

(2) the time span of interactive query based on historical data is usually between tens of seconds and minutes.

(3) the time span of data processing based on real-time data stream usually ranges from hundreds of milliseconds to several seconds.

At present, there are many relatively mature open source and commercial software to deal with the above three scenarios: the first kind of business, you can use MapReduce for batch data processing; the second kind of business, you can use Impala for interactive query; for the third kind of streaming data processing, you can think of professional streaming data processing tool Storm. But here is a very important problem: for most Internet companies, they generally encounter the above three scenarios at the same time, if different processing technologies are used to face these three scenarios, then the input / output data of these three scenarios are seamlessly shared, format conversion may be required between them, and each open source software requires a development and maintenance team, thus increasing the cost. Another inconvenience is that it is difficult to coordinate resource allocation among systems in the same cluster. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

So, is there a software that can handle the above three scenarios at the same time? Spark can, or has such potential. Spark supports complex batch, interoperability and stream computing, and is compatible with distributed file systems such as HDFS and Amazon S3, which can be deployed on popular cluster resource managers such as YARN and Mesos.

Starting from the design concept of Spark (memory-based iterative computing framework), it is most suitable for applications with iterative operations or where specific data sets need to be operated multiple times. And the more iterations, the larger the amount of data read, the more obvious the application effect of Spark. As a result, Spark is good at "iterative" applications such as machine learning, dozens of times faster than Hadoop MapReduce. In addition, because of the characteristic of storing intermediate data in memory, the processing speed of Spark Streaming is very fast, so it can also be used in situations where real-time processing of big data is needed.

Of course, there are occasions where Spark does not apply. For applications with asynchronous fine-grained state updates, such as Web service storage or incremental Web crawlers and indexes, it is not suitable for incremental modified application models. Spark is also not suitable for processing super large amounts of data. The term "super large" here is relative to the memory capacity of the cluster, because Spark stores the data in memory. Generally speaking, data above 10TB (single analysis) can be regarded as "super-large" data.

Generally speaking, for the data center of small and medium-sized enterprises, Spark is a good choice when the amount of data calculated at a single time is small. In addition, Spark is not suitable for hybrid cloud computing platforms, because the network transmission of hybrid cloud computing platforms is a big problem. Even if there is a dedicated broadband to transfer data between the cloud Cluster and the local Cluster, it is still less than the memory read speed.

Conclusion

Thank you for watching. If there are any deficiencies, you are welcome to criticize and correct them.

If you have a partner who is interested in big data or a veteran driver who works in big data, you can join the group:

658558542 (click on ☛ to join the group chat)

It collates a large volume of learning materials, all of which are practical information, including the introduction to big data's technology, high-level analysis language for massive data, distributed storage for massive data storage, and distributed computing for massive data analysis. for every big data partner, this is not only a gathering place for Xiaobai, but also Daniel online solutions! Welcome beginners and advanced partners to join the group to learn and communicate and make progress together!

Finally, I wish all the big data programmers who encounter bottlenecks to break through themselves and wish you all the best in the future work and interview.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report