What is the analysis of the concept and working principle of Storm 07/15 Update SLTechnology News&Howtos

What is the analysis of the concept and working principle of Storm

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you an analysis of the concept and working principle of Storm. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

The structure of Strom

Storm and traditional Relational Database

The traditional relational database is saved first and then calculated, while storm is calculated first and then saved, or even not saved.

It is difficult for traditional relational databases to deploy real-time computing, so they can only deploy scheduled task statistical analysis window data.

Relational database pays attention to transaction and concurrency control. Relatively speaking, Storm is relatively simple.

Storm and Hadoop,Spark are popular schemes of big data.

Languages closely related to Storm: core code is written in clojure, utilities are developed in python, and topology is developed in java

Topology

There are two kinds of nodes in the Storm cluster, one is the control node (Nimbus node), and the other is the working node (Supervisor node). All Topology tasks must be submitted on the Storm client node (you need to configure the storm.yaml file), and the Nimbus node is assigned to other Supervisor nodes for processing. The Nimbus node first splits the submitted Topology into Task, and submits the information about Task and Supervisor to the zookeeper cluster. Supervisor will go to the zookeeper cluster to claim its own Task and notify its own Worker process for Task processing.

Compared to MapReduce, which is also a computing framework, Job runs on the MapReduce cluster and Topology runs on the Storm cluster. But Job will end on its own after running, and Topology can only be dropped by manual kill, otherwise it will keep running.

Storm does not deal with the preservation of calculation results, which is the responsibility of the application code. If the data is small, you can simply save it in memory, update the database every time, or use NoSQL storage. This part of the matter is entirely left to the user.

The display after data storage is also something you need to deal with yourself. Storm UI only provides monitoring and statistics of topology.

The overall flow chart of Topology processing is:

Zookeeper cluster

Storm uses zookeeper to coordinate the entire cluster, but note that storm does not use zookeeper to deliver messages. So the load on zookeeper is very low, and a single node of zookeeper is sufficient in most cases, but if you want to deploy a larger storm cluster, then you need a little more zookeeper. For information on how to deploy zookeeper, you can see http://zookeeper.apache.org/doc / r3.3.3/zookeeperAdmin.html

There are some things to pay attention to when deploying zookeeper:

1. It is very important to monitor zookeeper well. Zookeeper is a fail-fast system, and you will exit as soon as any error occurs, so you need to monitor it in the actual scene. For more details, please see http://zookeeper.apache.org/doc/r3.3.3 / zookeeperAdmin.html#sc_supervision.

2. In the actual scenario, you need to configure a cron job to compress zookeeper data and business logs. Zookeeper itself is not going to compress these, so if you don't set up a cron job, you will soon find that there are not enough disks. For more details, see http://zookeeper.apache.org/doc/r3.3.3 / zookeeperAdmin.html#sc_maintenance.

Component

In Storm, Spout and Bolt are both Component. So, Storm defines a master interface called IComponent

The whole family is as follows: the green part is our most commonly used and relatively simple part. The red part is related to transactions.

Spout

Spout is the message generation source of Stream. The implementation of Spout components can be accomplished by inheriting BaseRichSpout classes or other Spout classes, or by implementing IRichSpout interfaces.

Public interface ISpout extends Serializable {

Void open (Map conf, TopologyContext context, SpoutOutputCollector collector)

Void close ()

Void nextTuple ()

Void ack (Object msgId)

Void fail (Object msgId)

}

Open () method-initialization method

Close ()-called when the spout is about to close. However, there is no guarantee that it will be called, because the supervisor node in the cluster can use kill-9 to kill the worker process. Only when Storm is running in local mode, if the stop command is sent, the execution of close can be guaranteed.

Ack (Object msgId)-A method for callbacks when tuple is successfully processed. Typically, this method is implemented by removing messages from the message queue to prevent message replay

Fail (Object msgId)-A method to handle callbacks when tuple fails. Typically, this method is implemented by putting the message back into the message queue and replaying it later.

NextTuple ()-- this is the most important method in the Spout class. Sending a Tuple to Topology is done in this way. When this method is called, storm makes a request to spout, asking spout to issue a tuple to the ouput collector. This method should be non-blocking, so if the spout is not emitted by tuples, this method should return. NextTuple, ack, and fail are all called in a loop in the same thread of the spout task. When there is no tuple emission, you should let nextTuple sleep for a short time (such as a millisecond) so as not to waste too much CPU.

After inheriting BaseRichSpout, you don't need to implement the close, activate, deactivate, ack, fail, and getComponentConfiguration methods, just care about the most basic core.

Usually (except Shell and transactional ones), implement a Spout, which can directly implement the interface IRichSpout. If you do not want to write extra code, you can directly inherit BaseRichSpout.

Bolt

The Bolt class receives the Tuple sent by the Spout or other upstream Bolt class and processes it. The implementation of Bolt components can be accomplished by inheriting BasicRichBolt classes or IRichBolt interfaces.

Prepare method-- similar to the open method in Spout, this method is called when task is initialized in a worker in the cluster. It provides an environment for bolt execution

DeclareOutputFields method-used to declare the fields (field) contained in the Tuple sent by the current Bolt, similar to those in Spout

The cleanup method, like the close method of ISpout, is called before closing. There is also no guarantee that it will be implemented.

Execute method-this is the most critical method in Bolt, and the processing of Tuple can be done in this method. The specific transmission is done through the emit method. Execute accepts a tuple for processing and feedback the processing result with the ack method of OutputCollector (indicating success) or fail (indicating failure) passed in by the prepare method.

Storm provides an IBasicBolt interface, and its purpose is that the Bolt that implements this interface does not have to provide feedback results in the code, and the Storm will automatically feedback success. If you really want to feedback failure, you can throw a FailedException

In general, implementing a Bolt can implement the IRichBolt interface or inherit BaseRichBolt. If you do not want to handle the result feedback yourself, you can implement the IBasicBolt interface or inherit BaseBasicBolt, which is actually equivalent to automatically implementing collector.emit.ack (inputTuple).

Topology running process

(1) after the Storm is submitted, the code is first stored in the inbox directory of the Nimbus node, and then a stormconf.ser file generated by the running configuration of Storm is placed in the stormdist directory of the Nimbus node, and the serialized Topology code file is also in this directory.

(2) when setting the Spouts and Bolts associated with Topology, you can set the number of executor and the number of task of current Spout and Bolt at the same time. By default, the sum of task of a Topology is the same as the sum of executor. After that, the system distributes the execution of these worker evenly according to the number of task. The supervisor node on which the worker runs is determined by the storm itself

(3) after the task is assigned, the Nimbus node will submit the task information to the zookeeper cluster. At the same time, there will be a workerbeats node in the zookeeper cluster, where the heartbeat information of all worker processes in the current Topology will be stored.

(4) the Supervisor node constantly polls the zookeeper cluster and stores all the Topology task assignment information, code storage directory, relationship between tasks and so on in the assignments node of zookeeper. Supervisor polls the contents of this node to get its own task and start the worker process to run.

(5) after a Topology runs, the Stream stream will be sent continuously through Spouts, and the received Stream stream will be processed continuously through Bolts. The Stream stream is unbounded.

The last step will be carried out without interruption unless the Topology is manually terminated.

Topology operation mode

Before you begin to create a project, it is important to understand the mode of operation (operation modes) of Storm. There are two ways for Storm to operate

The submission method of local operation, for example:

LocalCluster cluster = new LocalCluster ()

Cluster.submitTopology (TOPOLOGY_NAME, conf, builder.createTopology ())

Thread.sleep (2000)

Cluster.shutdown ()

Distributed submission method, for example:

StormSubmitter.submitTopology (TOPOLOGY_NAME, conf, builder.createTopology ())

It should be noted that after the Storm code is written, it needs to be packaged as a jar package and run in Nimbus. When packaging, you do not need to type out all the dependent jar, otherwise, if you enter the dependent storm.jar package, repeated configuration file errors will occur at run time, causing Topology to fail to run. Because the local storm.yaml configuration file is loaded before Topology runs.

The command to run is as follows: storm jar StormTopology.jar mainclass [args]

Commands for the storm daemon

Nimbus: storm nimbus starts the nimbus daemon

Supervisor: storm supervisor starts the supervisor daemon program

UI:storm ui this starts the stormUI daemon, providing a web-based user interface for monitoring the storm cluster.

DRPC: storm drpc starts the daemon for DRPC

Storm management command

JAR:storm jar topology_jar topology_class [arguments...]

The jar command is used to submit a cluster topology. It runs the main () method in the topology_class with the specified parameters, uploads the topology_jar to nimbus, and publishes it to the cluster by nimbus. Once submitted, storm activates the topology and starts processing the main () method in topology_class, which is responsible for calling the StormSubmitter.submitTopology () method and providing the name of a unique topology (cluster). If a topology with that name already exists in the cluster, the jar command will fail. It is common to use command-line parameters to specify the topology name so that the topology is named when it is submitted.

KILL:storm kill topology_name [- w wait_time]

To kill a topology, you can use the kill command. It destroys a topology in a secure way, first deactivating the topology and allowing the topology to complete the current data flow while waiting for the topology message. When you execute the kill command, you can specify the wait time after the topology is deactivated by-w [wait seconds]. The same function can also be achieved on the Storm UI interface.

Deactivate:storm deactivate topology_name

When the topology is deactivated, all distributed tuples will be processed and the nextTuple method of spouts will not be called. The same function can also be achieved on the Storm UI interface.

Activate:storm activate topology_name

Start a deactivated topology. The same function can also be achieved on the Storm UI interface.

Rebalance:storm rebalance topology_name [- w wait_time] [- n worker_count] [- e component_name=executer_count]...

Rebalance enables you to reassign cluster tasks. It's a powerful order. For example, you add nodes to a running cluster. The rebalance command will deactivate the topology, then reassign the worker after the corresponding timeout, and restart the topology

Example: storm rebalance wordcount-topology-w 15-n 5-e sentence-spout=4-e split-bolt=8

There are other management commands, such as Remoteconfvalue, REPL, Classpath, etc.

Considerations for creating a new storm project

In order to develop storm projects, you need to have storm's jar package in your classpath. The most recommended way is to use Maven. If you don't use maven, you can manually add all the jar packages in the storm distribution to classpath

The storm-starter project uses Leiningen as the build and dependency management tool. You can download this script (https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein) to install Leiningen and add it to your PATH to make it executable. To pull all dependent packages for storm, simply execute lein deps in the root directory of the project

The above is the analysis of the concept and working principle of Storm shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.