Why is Storm faster than Hadoop? 07/03 Update SLTechnology News&Howtos

Why is Storm faster than Hadoop?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains why Storm is faster than Hadoop. Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor learn why Storm is faster than Hadoop.

The word "fast" is not clear, and the specialty belongs to two levels:

Time delay refers to the time from the generation of the data to the result of the operation, and the "fast" of the subject should mainly refer to this.

Throughput refers to the amount of data processed by the system per unit time.

First of all, to be clear, when consuming the same resources, the latency of storm is generally lower than that of mapreduce. But the throughput was also lower than that of mapreduce.

The latency of network direct transmission and memory calculation of storm must be much lower than that of hadoop through hdfs; when the computing model is more suitable for streaming, the streaming processing of storm saves the time of batch data collection; because storm is a service-oriented job, it also saves the delay of job scheduling. So in terms of delay, storm is faster than hadoop.

In a typical scenario, thousands of log producers generate log files and need to perform some ETL operations to store them in a database.

If you use hadoop, you need to save hdfs first, and calculate it according to the granularity of cutting one file per minute (this granularity is extremely fine, and then there will be a pile of small files on hdfs). When hadoop starts to calculate, one minute has already passed, and then it takes another minute to schedule tasks, and then the job runs, assuming that there are so many machines that even a few banknotes are finished, and then it takes very little time to write the database hypothesis. In this way, at least two minutes have passed since the data was generated and made available.

On the other hand, streaming computing is when data is generated, there is a program to monitor the generation of logs all the time, and a row is generated and sent to the streaming computing system through a transmission system, and then processed directly by the streaming computing system, and then directly written to the database after processing. Each piece of data from generation to writing to the database can be completed at the millisecond level when resources are sufficient.

Of course, running a large file of wordcount, is originally a batch computing model, you have to put it on the storm for streaming processing, and then you have to wait for all the existing data processing to let storm output results, at this time, you compare it and hadoop fast, at this time, the comparison is not the delay, but the comparison of the throughput.

Storm is a typical stream computing system and mapreduce is a typical batch processing system. The following convection calculation and batch system flow.

The whole data processing process can be roughly divided into three stages:

1. Data acquisition and preparation

two。 Data calculation (involving intermediate storage in computing), the "decisions of those aspects" in the subject should mainly refer to the processing method at this stage.

3. Presentation of data results (feedback)

1) in the data acquisition stage, the current typical processing strategy: the data generation system usually comes from the log of page management and parsing DB, and the stream computing will collect the data in the message queue (such as kafaka,metaQ,timetunle). Batch processing systems generally collect data into distributed file systems (such as HDFS), and of course some use message queues. For the time being, let's call message queuing and file systems preprocessing storage. There is not much difference between the two in delay and throughput, and then there is a great difference from the preprocessing storage to the data computing stage. Stream computing generally reads the data from the message queue into the flow computing system (storm) in real time, and the batch processing system generally saves a large batch and then imports it into the computing system (hadoop). Here there is a difference in delay.

2) in the data computing stage, the low latency of the stream computing system (storm) mainly has the following aspects.

A: the storm process is resident and can be processed in real time with data

After storing a batch of mapreduce data, the job management system starts tasks, Jobtracker computing tasks are assigned, and tasktacker starts related computing processes.

B: stom data between each computing unit is transmitted directly through the zeromq.

The result of the operation of the mapreduce map task is written to the HDFS, because the reduce task is dragged to the operation through the network. Relatively speaking, there are too many disk reads and writes, which is relatively slow.

C: for complex operations

The operation model of storm directly supports DAG (directed acyclic graph).

Mapreduce needs to be composed of multiple MR processes, and some map operations are meaningless.

3) display of data results

The general results of stream computing are fed back directly to the final result set (display page, database, index of search engine). Mapreduce generally needs to import the results into the result set in batches after the whole operation is completed.

There is no essential difference between actual stream computing and batch processing systems, such as storm's trident also has a batch concept, and mapreduce can reduce the data set of each operation (such as starting every few minutes). Facebook's puma is a stream computing system based on hadoop.

At this point, I believe you have a deeper understanding of "Why Storm is faster than Hadoop". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.