Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What's the difference between Hadoop and Spark?

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Editor to share with you what is the difference between Hadoop and Spark, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

The level of problem solving is different.

First of all, Hadoop and Apache Spark are both big data frameworks, but they exist for different purposes. Hadoop is essentially more of a distributed data infrastructure: it dispatches large data sets to multiple nodes in a cluster of ordinary computers for storage, meaning you don't need to buy and maintain expensive server hardware.

At the same time, Hadoop will also index and track the data, making big data's processing and analysis efficiency high. Spark is a tool specially used to deal with those distributed storage big data, it will not carry out distributed data storage.

The two can be combined and separated.

Hadoop not only provides the common HDFS distributed data storage function, but also provides a data processing function called MapReduce. So here we can put aside Spark and use Hadoop's own MapReduce to complete the data processing.

On the contrary, Spark does not have to be attached to Hadoop in order to survive. But as mentioned above, after all, it does not provide a file management system, so it must be integrated with other distributed file systems to work. Here we can choose Hadoop's HDFS or other cloud-based data system platforms. But Spark is still used on Hadoop by default, after all, everyone thinks that their combination is *.

The following is the most concise and clear analysis of MapReduce from online excerpts:

We need to count all the books in the library. You count bookshelf 1, I count bookshelf 2. This is "Map". The more we are, the faster we will count books.

Now let's get together and add up everyone's statistics. This is "Reduce".

Spark data processing speed kills MapReduce in seconds.

Spark is much faster than MapReduce because of the way it handles data. MapReduce processes the data step by step: "read the data from the cluster, process it once, write the results to the cluster, read the updated data from the cluster, do the next processing, write the results to the cluster, and so on." This is how Booz Allen Hamilton data scientist Kirk Borne parses.

Spark, by contrast, does all the data analysis in memory in close to "real time": "read the data from the cluster, do all the necessary analytical processing, write the results back to the cluster, done," Born said. The batch processing speed of Spark is nearly 10 times faster than that of MapReduce, and the data analysis speed in memory is nearly 100 times faster.

If the data and result requirements that need to be processed are mostly static, and you have the patience to wait for the batch to complete, MapReduce's approach is perfectly acceptable.

But if you need convection data for analysis, such as data collected by sensors from the factory, or if your application requires multiple data processing, then you may be more likely to use Spark for processing.

Most machine learning algorithms require multiple data processing. In addition, the application scenarios that usually use Spark include the following aspects: real-time market activities, online product recommendation, network security analysis, machine diary monitoring and so on.

Disaster recovery

The two disaster recovery methods are very different, but both are very good. Because Hadoop writes each processed data to disk, it is inherently flexible in dealing with system errors.

Spark's data objects are stored in a distributed data cluster called resilient distributed datasets (RDD: Resilient Distributed Dataset). "these data objects can be placed either in memory or on disk, so RDD can also provide completed disaster recovery," Borne pointed out.

The above is all the content of the article "what's the difference between Hadoop and Spark". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report