Comparison between spark and hive storm mapreduce 07/12 Update SLTechnology News&Howtos

Comparison between spark and hive storm mapreduce

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Both Spark Streaming and Storm can be used for real-time stream computing. But the difference between them is very big. One of the differences

That is, the computing model of Spank Streaming is completely different from that of Stom. Spark Streaming is based on RDD, so you need to collect data within a short period of time, such as 1 second, as a RDD. Then deal with the data of this batch. However, Storm can process and calculate every piece of data immediately. Therefore, in a strict sense, Spark Streaming can only be called a quasi-real-time stream computing framework, while Storm is a real-time computing framework.

In addition, an advanced feature of Storm support is not available in Spark Streamng for the time being, that is, Storm supports the dynamic adjustment of parallelism during the operation of distributed streaming computing programs (Topology). As a result, the ability of concurrent processing is improved dynamically. SparkSreaming cannot dynamically adjust the degree of parallelism.

But Spark Streaming also has its advantages. First of all, because Spark Streaming is processed by the base Fbatch, it has several times or even ten times the throughput compared with Stom based on a single piece of data.

In addition, Spark Streaming is also in the Spark ecosystem, so Spark Streaming can communicate with Spark Core.SparkSQl. Even Spark MLuib Spark GraphX carries out dimensionless integration. After streaming the data, you can buy all kinds of Bmap immediately. Reduce conversion operations can be queried immediately using sqi, or even processed immediately using machne laming or graph calculation algorithms. This kind of one-stop big data processing functions and advantages are unmatched by Slorm.

Therefore, from the point of view of the above, we can choose to use Storm when the real-time requirement is very high and the amount of real-time data is unstable, such as when there is a peak in the daytime. However, if the real-time requirements are general, allow quasi-real-time processing for 1 second, and do not require dynamic parallelism, Spark Streamng is a better choice.

In fact, Spark SQL can not completely replace Hive. Because Hive is a kind of data warehouse based on FHDFs, and it provides a query engine based on QL model for distributed interactive query for big data's data warehouse.

Strictly speaking, what Spark SQL can replace is the query engine of Hive, not Hive itself. in fact, even in the production environment, Spark SQL queries the data in the Hive data warehouse. Spark itself does not provide storage, so it is naturally impossible to replace the function of Hive as a data warehouse.

One of the advantages of Spark SQL is that it is faster than the Hive query engine. The same SQL statement may use Hive's query engine because its underlying layer is based on MapReduce. You have to go through the shutfhe process to enter the disk, so the speed is very slow. A lot of complex SQL statements. Execution in hive takes more than an hour. Because of the memory-based characteristics of its underlying backbone Spak, SparkSQL is more than several times faster than the Hive query engine.

But Spark SQL is the same as Spark. It is a new rookie in the field of big data, so it is not perfect. There are a few advanced features supported by Hive, but not supported by Spark SQL. As a result, Spark SQL can not completely replace the query engine of Hive for the time being. It can only be used in scenarios where some Spark SQL features can meet the requirements.

Another advantage of Spark SQL over Hive is that it supports a large number of different data sources, including hive.json. Parquet, jdbc and so on. In addition, because Spark SQL is in the Spark technology stack and works based on RDD, it can be seamlessly integrated with other components of Spark to achieve many complex functions. such as。 Spark SQL support can execute sql statements directly against hdts files!

Various offline batch processing functions that can be completed by MapReduce, as well as common algorithms (such as secondary sorting, topn, etc.), core programming based on Spark RDD, can be realized, and can be better and easier to implement. And the high-speed wire batch program based on Spark RDD runs several times faster than MapReduce. There is a very obvious advantage in speed.

The main reason why Spark is faster than MapReduce is that the computing model of MapReduce is too rigid and must be mapreduce mode. Sometimes, even if you complete some operations such as overreduction, you have to go through the mapreduce process, so you have to go through the shufle process. The shffle process of MapReduce is the most performance-consuming, because the process in the middle of shuffe must be read and written based on the input disk. Although Spark's shuthe is also disk-based, a large number of ransformation operations, such as simple map or hiter operations, can be directly based on memory for pipeline operations, so the speed performance is naturally greatly improved.

But Spark also has its disadvantages. Because Spark is calculated based on memory, although it is easy to develop, when you really face big data (for example, an operation at a level above 1 billion), various problems may occur without tuning, such as OOM memory overflow and so on. As a result, the Spark program may not be able to run fully, so it is reported to be dead, while MapReduce can at least finish slowly, even if it is slow.

In addition, Spark is a rising technology rookie, so the degree of perfection in the field of big data is definitely not as good as that of MapReduce. For example, based on HBase and Hive as the input and output of offline batch programs, Spark is far from being perfected by MapReduce. It is very troublesome to achieve.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.