How does Storm work? 07/16 Update SLTechnology News&Howtos

How does Storm work?

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about the working principle of Storm. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

What is 1 Storm?

Storm is a distributed real-time computing framework; according to author Nathan Marz, the meaning of Storm for real-time computing is similar to that of Hadoop for batch processing.

Storm is a streaming computing framework and real-time computing framework, while Hadoop is essentially a batch computing framework and offline computing framework.

The core code of Storm uses clojure, and the other part of the code is developed in python;. Users can use java to develop topology.

2 Workflow of Storm

There are two kinds of nodes in Storm cluster, one is Nimbus (control node), the other is Supervisor (work node). Here's how they work:

1. Client submits topology to Nimbus

2. Nimbus creates a local directory for the topology and splits the topology into task

3. Create an assignments node on zookeeper to store the correspondence between task and woker in the supervisor node

4. Create a taskbeats node on zookeeper to monitor the heartbeat of task and start topology

5. Supervisor polls zookeeper, claims the tasks assigned to it, starts multiple woker processes, and each work creates a corresponding task thread; initializes the connection between task based on topology information, and finally the whole topology runs.

Topology processing flowchart:

3 Zookeeper cluster

Storm uses zookeeper to coordinate the entire cluster, but note that storm does not use zookeeper to deliver messages. So the load on zookeeper is very low, and a single node of zookeeper is sufficient in most cases, but if you want to deploy a larger storm cluster, then you need a little more zookeeper. For information on how to deploy zookeeper, refer to: http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html.

There are some things to pay attention to when deploying zookeeper:

It is very important to monitor zookeeper. Zookeeper is fail-fast 's system and will exit as soon as something goes wrong. See http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_supervision for more details.

In the production environment scenario of storm, configure a cron job to compress the data and business logs of zookeeper. Zookeeper itself is not going to compress these, so if you don't set up a cron job, you'll soon find that you don't have enough disks. More details can be found at http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_maintenance.

4 the program structure of Topology

Compared to MapReduce, which is also a computing framework, Job runs on the MapReduce cluster and Topology runs on the Storm cluster. However, Job will end on its own after running, and Topology can only be kill manually, otherwise it will continue to run.

Storm does not deal with the preservation of calculation results, which is the responsibility of the application code. If the data is small, you can simply save it in memory, update the database every time, or use NoSQL storage. This part of the matter is entirely up to the application developers.

Structure diagram of Topology:

4.1 Component

In the figure above, Spout and Bolt are both Component. So, Storm defines a master interface called IComponent. The family class diagram is as follows: the green part is the most commonly used and relatively simple part; the red part is related to the transaction.

4.2 Spout

Spout is the message generation source of Stream. The implementation of Spout components can be accomplished by inheriting BaseRichSpout classes or other Spout classes, or by implementing IRichSpout interfaces:

Public interface ISpout extends Serializable {void open (Map conf, TopologyContext context, SpoutOutputCollector collector); void close (); void nextTuple (); void ack (Object msgId); void fail (Object msgId);}

Open ()-initialization method

Close ()-called when the spout is about to close. However, there is no guarantee that it will be called, because the supervisor node in the cluster can use kill-9 to kill the worker process. Only when Storm is running in local mode, if the stop command is sent, the execution of close can be guaranteed.

Ack (Object msgId)-A method for callbacks when tuple is successfully processed. Typically, this method is implemented by removing messages from the message queue to prevent message replay

Fail (Object msgId)-- A method to handle callbacks when tuple fails. Typically, this method is implemented by putting the message back into the message queue and replaying it later.

NextTuple ()-- this is the most important method in the Spout class. Sending a Tuple to Topology is done in this way. When this method is called, storm makes a request to spout, asking spout to issue a tuple to the ouput collector. This method should be non-blocking, so if the spout is not emitted by tuples, this method should return. NextTuple, ack, and fail are all called in a loop in the same thread of the spout task. When there is no tuple emission, you should let nextTuple sleep for a short time (such as a millisecond) so as not to waste too much CPU.

After inheriting BaseRichSpout, you don't have to implement close (), activate (), deactivate (), ack (), fail (), and getComponentConfiguration (), but only care about the most basic core. In general (except for Shell and transactional ones), implement a Spout that can directly implement the interface IRichSpout, and if you don't want to write extra code, you can directly inherit BaseRichSpout.

4.3 Bolt

The Bolt class receives the Tuple sent by the Spout or other upstream Bolt class and processes it. The implementation of Bolt components can be accomplished by inheriting BasicRichBolt classes or IRichBolt interfaces.

Prepare ()-- this method is similar to the open method in Spout and is called when task is initialized in a worker in the cluster. It provides an environment for bolt execution.

DeclareOutputFields ()-- used to declare the fields (field) contained in the Tuple currently sent by Bolt, similar to those in Spout.

Cleanup ()-- same as the close method of ISpout, called before closing. There is also no guarantee that it will be implemented.

Execute ()-- this is the most critical method in Bolt, and the processing of Tuple can be done in this method. The specific transmission is done through the emit method. Execute accepts a tuple for processing and feedback the processing result with the ack method of OutputCollector (indicating success) or fail (indicating failure) passed in by the prepare method.

Storm provides an IBasicBolt interface, and its purpose is that the Bolt that implements this interface does not have to provide feedback results in the code, and the Storm will automatically feedback success. If you really want to feedback a failure, you can throw a FailedException. In general, implementing a Bolt can implement the IRichBolt interface or inherit BaseRichBolt, and if you don't want to handle the result feedback yourself, you can implement the IBasicBolt interface or inherit BaseBasicBolt, which is actually equivalent to automatically implementing collector.emit.ack (inputTuple).

5 the operation mode of Topology

Before you begin to create a project, it is important to understand the mode of operation (operation modes) of Storm. Storm can be run in two ways:

Submission method for local operation:

LocalCluster cluster = new LocalCluster (); cluster.submitTopology (TOPOLOGY_NAME, conf, builder.createTopology ()); Thread.sleep (2000); cluster.shutdown ()

Distributed submission method:

StormSubmitter.submitTopology (TOPOLOGY_NAME, conf, builder.createTopology ())

It is important to note that after writing the Topology code, you need to package it into jar and run it on Nimbus. When packing, you don't need to type in the dependent storm.jar, otherwise the runtime will report an error. Because in cluster mode, topology is executed depending on the cluster environment (see storm.yaml configuration file).

Run the command as follows: storm jar StormTopology.jar mainclass [args]

The above is the working principle of Storm shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.