Which is faster, Storm or Hadoop? 02/12 Update SLTechnology News&Howtos

Which is faster, Storm or Hadoop?

2026-02-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "Storm and Hadoop which is faster". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let Xiaobian take you to learn "Storm and Hadoop which is faster"!

The word "fast" is ambiguous, and there are two levels of professionalism:

1. Time delay refers to the time from the generation of data to the production of results. The "fast" of the subject should mainly refer to this.

2. Throughput refers to the amount of data processed by the system per unit time.

First of all, it is clear that in the case of the same consumption of resources, the delay of storm is generally lower than that of mareduce. But throughput is also lower than mapreduce.

Storm's network direct transmission and memory calculation must have a much lower delay than hadoop's transmission through hdfs; when the calculation model is more suitable for streaming, Storm's streaming processing saves the time for collecting data in batch processing; because Storm is a service-type job, it also saves the delay of job scheduling. So in terms of latency, storm is faster than hadoop.

Say a typical scenario, thousands of log producers produce log files, which need to be stored in a database for some ETL operations.

Assuming the use of hadoop, you need to store hdfs first, according to the granularity of cutting a file every minute to calculate (This granularity has been extremely fine, and then small words will be a pile of small files on hdfs), hadoop began to calculate, 1 minute has passed, and then began to schedule the task took another minute, and then the job runs up, assuming that the machine is particularly large, a few banknotes are finished, and then write the database assumption also took a very small time, so that from the data generation to *** can be used has passed at least two minutes.

When streaming computing is data generation, there is a program to monitor the generation of logs all the time. A row generated is sent to the streaming computing system through a transmission system, and then the streaming computing system directly processes and writes it directly to the database after processing. Each piece of data can be completed in milliseconds from generation to writing to the database when resources are sufficient.

Of course, running a wordcount of a large file is originally a batch computing model. You have to put it on Storm for streaming processing, and then you have to wait for all the existing data to be processed before Storm outputs the results. At this time, you compare it with hadoop. At this time, the comparison is not delay, but throughput.

Storm is a typical stream computing system and mapreduce is a typical batch processing system. The following is the flow chart for convective calculation and batch processing systems.

The entire data processing process can be roughly divided into three stages:

1. Data acquisition and preparation

2. Data calculation (involving intermediate storage in calculation),"which aspects determine" in the title should mainly refer to the processing method at this stage.

3. Data Results Presentation (Feedback)

1)At the data collection stage, the typical processing strategy at present: the data generation system generally comes from the page dot and the log of the DB, and the flow calculation will be the message queue in the data collection (such as kafaka,metaQ,timetunle). Batch processing systems typically collect data into distributed file systems (such as HDFS), although some use message queues. Let's call message queues and file systems preprocessed storage for now. There is not much difference between the two in delay and throughput, and then there is a big difference from the preprocessing storage to the data calculation stage. Stream calculation is generally performed on the data that reads the message queue into the stream computing system (storm) in real time. Batch processing system 1 generally saves a large number of data and imports them into the computing system (hadoop) in batches. There is a difference in delay here.

2)In the data computation stage, the low latency of the stream computation system (storm) mainly has the following aspects:

A: storm process is resident, there is data can be real-time processing

After mapreduce data is accumulated, the job management system starts the task, Jobtracker calculates the task allocation, and tasktacker starts the relevant operation process.

B: stom Data between each computing unit is transmitted directly through the network (zeromq).

The result of mapreduce map task operation is written to HDFS, because the reduce task is dragged through the network for operation. Relatively speaking, more disk reads and writes, slower

C: For complex operations

Storm's computing model directly supports DAG(Directed Acyclic Graph)

Mapreduce requires multiple MR processes, some map operations have no meaning

3)Data Results Presentation

Flow computation results are generally fed back directly into the final result set (display pages, databases, search engine indexes). Mapreduce generally requires the results to be imported into the result set in batches after the entire operation is completed.

There is no essential difference between actual stream computing and batch processing systems, such as storm trident also has batch concept, and mapreduce can reduce the data set of each operation (such as a few minutes to start once), facebook puma is based on hadoop to do stream computing system.

At this point, I believe that everyone has a deeper understanding of "Storm and Hadoop which is faster". Let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.