What is Storm? 07/19 Update SLTechnology News&Howtos

What is Storm?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "what is Storm". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

What is Storm?

If only one sentence is used to describe storm, it might be like this: a distributed real-time computing system. According to the authors of storm, storm is similar to what hadoop is to batch processing for real-time computing. As we all know, hadoop implemented according to google mapreduce provides us with the map, reduce primitive, which makes our batch program very simple and elegant. Similarly, storm provides some simple and beautiful primitives for real-time computing.

Let's take a look at the applicable scenarios of storm.

1. Stream data processing. Storm can be used to process a steady stream of messages and then write the results to some storage.

2. Continuous calculation. Continuously send data to clients so that they can update and display results in real time, such as website metrics.

3. Distributed rpc. Because the processing components of storm are distributed and the processing latency is very low, it can be used as a general distributed rpc framework. Of course, our search engine itself is also a distributed rpc system.

Characteristics of Storm

Storm is an open source distributed real-time computing system, which can handle a large number of data streams simply and reliably. Storm has many usage scenarios: such as real-time analysis, online machine learning, continuous computing, distributed RPC,ETL, and so on. Storm supports horizontal scaling, has high fault tolerance, ensures that every message is processed quickly (in a small cluster, each node can process millions of messages per second). The deployment and operation and maintenance of Storm are very convenient, and more importantly, you can use any programming language to develop applications.

Storm cluster architecture:

Nimbus Primary Node:

The master node usually runs a daemon, Nimbus, which is used to respond to nodes distributed in the cluster, assign tasks, and monitor failures. This is very similar to the Job Tracker in Job Tracker Hadoop.

Supervisor worker node:

The worker node also runs a daemon, Supervisor, to listen to job assignments and run worker processes based on requirements. Each worker node is an implementation of a subset of topology. The coordination between Nimbus and Supervisor is through the Zookeeper system or cluster.

Zookeeper

Zookeeper is a service that performs coordination between Supervisor and Nimbus. On the other hand, the real-time logic of the application is encapsulated into the "topology" in Storm. Topology is a group of graphs connected by Spouts (data source) and Bolts (data manipulation) through Stream Groupings.

Topology (Topology)

A real-time application running in storm because the flow of messages between components forms a logical topology. A topology is a diagram composed of spouts and bolts. Connect the spouts and bolts in the figure through stream groupings, as shown below:

A topology will run until you manually kill off, Storm automatically reassigns failed tasks, and Storm ensures that you will not lose data (if high reliability is enabled). If some machines stop accidentally, all the tasks on it will be transferred to other machines.

Running a topology is simple. First, put all your code and the jar you depend on into a jar package. Then run a command similar to the following:

Storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2

This command runs the main class: backtype.strom.MyTopology, with parameters arg1, arg2. The main function of this class defines the topology and submits it to Nimbus. Storm jar is responsible for connecting to Nimbus and uploading jar packages.

Topology is defined as a Thrift structure, and Nimbus is a Thrift service, and you can submit topology created in any language. The above is the easiest method to submit in the JVM-based language.

Tuple

Storm uses tuple as its data model. Each tuple is a pile of values, each value has a name, and each value can be of any type. In my understanding, a tuple can be seen as a java object with no methods. Overall, storm supports all basic types, strings, and byte arrays as value types for tuple. You can also use your own defined types as value types, as long as you implement the corresponding serializer (serializer).

A Tuple represents a basic processing unit in a data flow, such as an cookie log, which can contain multiple Field, each Field representing an attribute.

Tuple is supposed to be a Map of Key-Value, but since the field names of the tuple passed between components have been defined in advance, the Tuple only needs to fill in each Value in order, so it is a Value List.

An unbounded, continuous, continuous Tuple sequence makes up Stream.

Stream

Stream is the key abstraction in storm. A stream is an unbounded tuple sequence. Storm provides primitives to transfer a stream to a new stream in a distributed and reliable manner. For example, you can stream a tweets to a stream of hot topics.

The most basic primitives for dealing with stream provided by storm are spout and bolt. You can implement the interfaces corresponding to Spout and Bolt to handle the logic of your application

Spout

The message source spout is a message producer in a topology in Storm. In short, Spout reads data from the source and puts it into topology. Spout is the source of the stream. For example, a spout might read messages from a Kestrel queue and transmit them into a stream. Another example is that a spout can call an api of twitter and emit the returned tweets into a stream.

Typically, Spout reads data from external data sources (queues, databases, etc.), encapsulates it in Tuple form, and then sends it to Stream. Spout is an active role, there is a nextTuple function inside the interface, and the Storm framework will keep calling this function.

Bolt

All processing in Topology is done by Bolt. That is, all the message processing logic is encapsulated in bolts. Bolt can do anything, such as filtering connections, aggregating, accessing files / databases, and so on.

Bolt processes the input Stream and produces a new output Stream. Bolt can perform filtering, function operations, Join, database manipulation, and so on. Bolt is a passive role with an execute (Tuple input) method in its interface that is called when a message is received, where users can execute their own processing logic.

Bolt receives data from the Spout and processes it, and if it encounters the processing of a complex stream, it is possible to send the tuple to another Bolt for processing. That is, you need to go through a lot of blots. For example, figuring out the most retweeted images in a pile of pictures requires at least two steps: the first step is to calculate the number of retweets for each picture. The second step is to find the top 10 images that have been retweeted the most. (more steps may be needed to make this process more scalable).

Stream Groupings

Stream Grouping defines how a stream should be split between Bolt tasks. Here are seven Stream Grouping types provided by storm:

1) random grouping (Shuffle grouping): randomly distribute tasks from tuple to Bolt to ensure that each task gets an equal number of tuple.

2) Field grouping (Fields grouping): data streams are divided and grouped according to the specified fields. For example, according to the "user-id" field, tuples with the same "user-id" are always distributed to the same task, while tuples with different "user-id" may be distributed to different tasks.

3) grouping all (All grouping): all tasks that tuple is copied to bolt. This type needs to be used with caution.

4) Global grouping (Global grouping): all streams are assigned to the same task of bolt. Specifically, it is assigned to the smallest task of the ID.

5) No grouping (None grouping): you don't need to care about how streams are grouped. At present, no grouping is equivalent to random grouping. In the end, however, Storm will put the ungrouped Bolts into the same thread where Bolts or Spouts subscribes to them, if possible.

6) Direct grouping (Direct grouping): this is a special grouping type. The tuple producer determines which tuple handler task receives the tuple.

7) Local or random grouping (Local or shuffle grouping): if the target bolt has one or more tasks in the same worker process, tuples will randomly send them to those tasks. Otherwise, it will be the same as ordinary random grouping.

Of course, you can also implement the CustomStreamGroupimg interface to customize the grouping you need.

The running Topology mainly consists of the following three components:

Worker processes (process)

Executors (threads) (thread)

Tasks

Worker: runs a process that specifically deals with the logic of the component. A Topology may be executed in one or more Worker (worker processes), each worker is a physical JVM and executes part of the entire Topology. Storm will try to distribute work evenly among all worker

Executor: each execcutor corresponds to a thread. 1 executor is 1 thread generated by 1 worker process. It may be running one or more task of the same component (spout or bolt)

Task: each Spout or Bolt is run as many task throughout the cluster, running one or more task in the thread executor

That's all for "what Storm is". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.