What are the advantages of Storm 07/02 Update SLTechnology News&Howtos

What are the advantages of Storm

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the advantages of Storm". In the operation of practical cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The difference between Storm and hadoop

Data source: HADOOP is the data under a folder on HDFS that may be TB, and STORM is a piece of data added in real time.

Process: HADOOP is divided into MAP stage to REDUCE stage. STORM is a user-defined process. The process can contain multiple steps, each of which can be a data source (SPOUT) or processing logic (BOLT).

Whether or not to end: the HADOOP will end at last, and the STORM will not end. At the last step, stop there and start all over again when new data comes in.

Processing speed: HADOOP is for the purpose of dealing with a large amount of data on HDFS, and the speed is slow. STORM can be very fast as long as it deals with a new piece of data.

Applicable scenarios: HADOOP is used when dealing with a batch of data, without paying attention to timeliness. To process, submit a JOB,STORM is used to deal with some new data, and timeliness should be emphasized.

Compared with MQ: HADOOP has no contrast. STORM can be seen as having N steps. After each step is processed, a message is sent to the next MQ, and consumers of this MQ are monitored to continue processing.

A good programming model allows developers to focus on business logic; a bad programming model allows developers to spend time on communications, handling exceptions and other trivia.

Examples of programming models:

Use hadoop's MapReduce and MPI to make a comparison. In hadoop's MapReduce, in its programming model, map and reduce, you only need to write map and reduce functions, as well as some simple drivers, and the program can run. You don't have to care about how map and data are split, how map and reduce are transmitted, and how reduce data is written into hadoop's HDFS. You don't have to care about these. It seems that writing mapreduce is stand-alone code. There are no distributed features in it, but it runs a distributed framework to help you do these things.

On the other hand, we can see that the program for writing MPI is completely different. When you write MPI, you will obviously feel that you are writing a parallel and distributed program. You need to explicitly detune the interface for data transmission in many places, and you also need to explicitly call some interface for data synchronization. Only in this way can you explicitly give the MPI program to RUN. This is the different development experience caused by different programming models. In fact, this is not only a question of whether it is easy to develop, but also a question of development efficiency. In fact, simpler programs can write robust programs. It is very easy to write mapreduce programs, but it is more difficult to write a stable and reliable MPI program.

Architecture

Nimbus

Supervisor

Worker

Programming model:

DAG

Spout

Bolt

Data transfer:

Zmq

Zmq is also an open source messaging framework. Although it is called mq, it is not a message queue, but a better encapsulated one.

Netty

Netty is the network framework of NIO with high efficiency. The reason why netty is storm after apache, zmq's license and storm's license are not compatible. Bolt will tell Spout after processing the message.

High availability

Exception handling

Message reliability guarantee mechanism

Maintainability:

Storm has a UI to watch the program monitoring running on it.

Real-time request response service (synchronization)

Real-time request response service (synchronization) is often not a very simple operation, and a large number of operations, using the DAG model to improve the speed of request processing.

DRPC

Real-time request processing

Example: send a picture, or a picture address, to extract the features of the picture.

What are the advantages of DRPC Server here? Isn't it more troublesome to look like a Server, passing through Spout and then passing through Bolt? DRPC Server is actually suitable for distributed processing, and you can use distributed processing of this single request to speed up the processing process.

DRPCClientclient = new DRPCClient ("drpc-host", 3772); String result = client.execute ("reach", "http://twitter.com");// server consists of four parts: a DRPC Server, a DPRC Spout, a Topology, and a ReturnResult."

Streaming (asynchronous)-not to say unhappy, but not to wait for the result

Deal with it one by one

Example: ETL, extract the data you care about, and store it in a standard format. Its characteristic is that I give you the data, so you don't have to return it to me. This is asynchronous.

Analysis and statistics

Examples: log PV,UV statistics, access hotspot statistics, there is a correlation between this kind of data, such as aggregating, summing, averaging and so on by certain fields.

Finally, write it to Redis,Hbase,MySQL or other MQ to consume it for other systems.

/ * ShuffleGrouping ("spout") is to subscribe to data from spout. FieldGrouping ("split", new Fields ("word")) * is actually a hash. The same word has the same hash, and then it will be hash to the bolt of the same WordCount, and then * can be counted. The next two lines are the configuration file, then the configuration of 3 worker, and the next step is to submit the Topology * to the Storm cluster through Submitter. The program will be compiled and packaged, and this code comes from a piece of code from starter in storm. How does this code really * run? just use the name of the storm jar and then the name of the jar package, then the name of the class, and the name of the topology, because there is an args [0]. * this code is very simple. First of all, the first part constructs a directed acyclic graph of DAG, then generates the configuration and submits it to the Storm cluster. * * / public static void main (String [] args) {TopologyBuilder builder = new TopologyBuilder (); builder.setSpout ("spout", new RandomSentenceSpout (), 5); builder.setBolt ("split", new SplitSentence (), 8). ShuffleGrouping ("spout"); builder.setBolt ("count", new WordCount (), 12). FieldsGrouping ("split", new Fields ("word"); Config conf = new Config () Conf.setDebug (true); if (args! = null & & args.length > 0) {conf.setNumWorkers (3); StormSubmitter.submitTopologyWithProgressBar (args [0], conf, builder.createTopology ());} # Linux: storm jar storm-starter-0.9.4.jar storm.starter.WordCountTopology wordcount

Cluster Summary (for the entire cluster)

A slot is a worker, a worker is a jvm, a worker can have multiple executor, each executor is the execution thread, each executor executes one or more Task, generally speaking, the default is a task.

Topology Summary (per application)

An application is a Topology, it has a name, and ID, and then there is a state, ACTIVE is running, KILLED has been killed.

Topology actions can take some actions on Topology, Deactivate is pause, Rebalance is to redo balance, and kill is to kill the application.

How exactly does this application run? in Topology stats, there is an overall statistics of 10 minutes, 3 hours, 1 day, and all the statistics. The key is Complete latency, which means how long it takes for a piece of data to be sent and processed. The second key is ACK, which reflects the delay reflected by the previous Complete latency.

In the statistics of Spouts, one is the name of spout, which corresponds to the code, and the second is how many executor does this spout have, and then how many task does it have, and then, uh, how much data it sends out emit within a certain period of time, how much data is actually transmitted by tranfer, and then what is its latency delay, and then how much data is processed by ACK, followed by error information.

Similarly, Bolt can view these statistics in real time through this UI page, which is very useful to know which link is slow, where there is a bottleneck, and whether a concurrency is added to solve the problem.

The most important thing in Spout here is a nextTuple (), which is the source of taking data from the outside. You can take data from DPRC, you can take data from MQ, such as Kafka, and then process it to the later bolt. Then wordcount is not so complex, so you generate your own data randomly.

_ collector.emit (new Values (sentence), new Object ())

New Object () after this code is equal to randomly generating an ID of message. What is the use of this ID? as we will see later, it is actually part of the guarantee of message reliability. With this ID, Storm can help you track whether this message has been processed or not, and if so?

If it's done, it just calls an ack to tell spout that I'm done, and here the ack method just prints out the id, because id doesn't make any sense here, just to show. On the contrary, if it's not finished within a certain period of time, fail will be called to tell you that the message processing failed.

/ * * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * http://www.apache.org/licenses/LICENSE-2.0 * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "ASIS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * / package storm.starter.spout;import backtype.storm.spout.SpoutOutputCollector;import backtype.storm.task.TopologyContext;import backtype.storm.topology.OutputFieldsDeclarer;import backtype.storm.topology.base.BaseRichSpout;import backtype.storm.tuple.Fields;import backtype.storm.tuple.Values;import backtype.storm.utils.Utils;import java.util.Map;import java.util.Random;public class RandomSentenceSpout extends BaseRichSpout {SpoutOutputCollector _ collector; Random _ rand; @ Override public void open (Map conf, TopologyContext context, SpoutOutputCollector collector) {_ collector = collector _ rand = new Random ();} @ Override public void nextTuple () {Utils.sleep (100); String [] sentences = new String [] {"the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences [_ rand.nextInt (sentences.length)] _ collector.emit (new Values (sentence), new Object ());} @ Override public void ack (Object id) {System.out.println (id);} @ Override public void fail (Object id) {System.out.println (id);} @ Override public void declareOutputFields (OutputFieldsDeclarer declarer) {declarer.declare (new Fields ("word"));} @ Override public Map getComponentConfiguration () {return null }} public static class WordCount1 extends BaseBasicBolt {Map counts = new HashMap (); @ Override public void execute (Tuple tuple, BasicOutputCollector basicOutputCollector) {String word = tuple.getStringByField ("word"); Integer count= counts.get (word); if (count==null) {count=0;} count++; counts.put (word,count) System.out.println ("word++" + word+ "=" + count); basicOutputCollector.emit (new Values (word,count));} @ Override public void declareOutputFields (OutputFieldsDeclarer outputFieldsDeclarer) {outputFieldsDeclarer.declare (new Fields ("word", "count"));}}

For the example of wordcount, there are two blot, one bolt is a participle and the other is a count. Here SplitSentence shows that it supports multi-language development. In fact, the code here calls python's splitsentence.py and uses the component ShellBolt.

Then the bolt of wordcount is implemented in java. The core of its implementation is the highlight. One point is that there is a function like execute, and the second is the function of declareOutputFields. What are the functions of these two functions actually? In fact, the core thing is that the function of execute,execute is to get the input data Tuple, and then emit the data out.

The above is the simplest example of wordcount in storm, the code of its main function, the command line code it submits, what the Spout looks like, what the Bolt looks like, what kind of health it looks like after it is submitted to the Storm cluster, and what core information you can see on WebUI, which will be widely used in later application development.

Comparison between Storm and other technologies

Storm: processes and threads run permanently, data does not enter the disk, and is transmitted over the network.

MapReduce:TB, PB-level data design, a batch job.

Storm: pure streaming, processing data unit is a Tuple. In addition, Storm is specially designed for streaming processing, and its data transmission mode is simpler and more efficient in many places. It is not impossible to do batch processing, it can also do micro-batch processing to improve throughput.

Spark Streaming: microbatch processing, how does a batch do streaming? it can do processing tasks quickly based on memory and DAG, and make RDD very small to approach streaming with small batches.

Compared with other systems such as MPI

By comparison, you can better understand some of the characteristics of Storm:

First of all, compared with Queue+Worker, it is a general distributed system, and some details of the distributed system can be shielded, such as horizontal expansion, fault tolerance, and the upper application only needs to pay attention to its own business logic, which is very important for application developers, otherwise the business logic will be disturbed by some of the underlying details.

On the other hand, as a pure streaming system, Storm is quite different from mapreduce, one is called streaming, the other is called batch processing, Storm is a resident running, its message sending and receiving is very efficient.

Compared with a micro-batch system like spark, Storm can handle a single message.

Generally speaking, Storm was defined as a distributed streaming system at the beginning of its design, so most of the streaming computing requirements can be well met through Storm. Storm is also doing quite well in terms of stability, which is a very good choice for real-time streaming computing.

This is the end of the content of "what are the advantages of Storm". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.