Strom Foundation 04/21 Update SLTechnology News&Howtos

Strom Foundation

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Strom classic atlas:

Strom basic TopologiesStreamsSpoutsBoltsStream groupingsReliabilityTasksWorkersConfiguration1, Topologies

A topology is a diagram composed of spouts and bolts. Connect the spouts and bolts in the figure through stream groupings, as shown below:

A topology will run until you manually kill off, Storm automatically reassigns failed tasks, and Storm ensures that you will not lose data (if high reliability is enabled). If some machines stop accidentally, all the tasks on it will be transferred to other machines.

Running a topology is simple. First, put all your code and the jar you depend on into a jar package. Then run a command similar to the following:

Storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2

This command runs the main class: backtype.strom.MyTopology, with parameters arg1, arg2. The main function of this class defines the topology and submits it to Nimbus. Storm jar is responsible for connecting to Nimbus and uploading jar packages.

Topology is defined as a Thrift structure, and Nimbus is a Thrift service, and you can submit topology created in any language. The above aspects are the easiest way to submit in the JVM-based language.

2 、 Streams

Message flow stream is a key abstraction in storm. A message flow is an unbounded tuple sequence, and these tuple sequences are created and processed in parallel in a distributed manner. Define the stream by naming each field in the tuple sequence in the stream. By default, the field types of tuple can be: integer,long,short, byte,string,double,float,boolean and byte array. You can also customize the type (as long as you implement the corresponding serializer).

Each message flow is assigned an id when it is defined, because one-way message flows are commonly used, and OutputFieldsDeclarer defines methods that allow you to define a stream without specifying the id. In this case, the stream assigns a default id with a value of 'default'.

The most basic primitives for dealing with stream provided by Storm are spout and bolt. You can implement the interfaces provided by spout and bolt to handle your business logic.

3 、 Spouts

The message source spout is a message producer in a topology in Storm. Generally speaking, a message source reads data from an external source and sends a message to the topology: tuple. Spout can be reliable or unreliable. If the tuple is not successfully processed by the storm, the reliable message source spouts can retransmit a tuple, but the unreliable message source spouts cannot retransmit once it sends a tuple.

A message source can emit multiple message flow stream. Use OutputFieldsDeclarer.declareStream to define multiple stream, and then use SpoutOutputCollector to emit the specified stream.

The most important method in the Spout class is nextTuple. Either launch a new tuple into the topology or simply return if there is no new tuple. Note that the nextTuple method cannot be blocked because storm calls the methods of all message source spout on the same thread.

The other two more important spout methods are ack and fail. Storm calls ack when it detects that a tuple has been successfully processed by the entire topology, otherwise it calls fail. Storm calls ack and fail only on reliable spout.

4 、 Bolts

All the message processing logic is encapsulated in bolts. Bolts can do many things: filtering, aggregating, querying databases, and so on.

Bolts can simply do the delivery of message flows. Complex message flow processing often requires a lot of steps, which requires a lot of bolts. For example, figuring out the most retweeted images in a pile of pictures requires at least two steps: the first step is to calculate the number of retweets for each picture. The second step is to find the top 10 images that have been retweeted the most. (more steps may be needed to make this process more scalable).

Bolts can emit multiple message flows, use OutputFieldsDeclarer.declareStream to define stream, and use OutputCollector.emit to select the stream to be emitted.

The main method of Bolts is execute, which takes a tuple as input, and bolts uses OutputCollector to emit tuple,bolts. You must call OutputCollector's ack method for each tuple it processes to notify Storm that the tuple has been processed, thereby notifying the tuple's emitter spouts. The general process is that bolts processes an input tuple, emits 0 or more tuple, and then calls ack to inform storm that it has processed the tuple. Storm provides an IBasicBolt that automatically calls ack.

5 、 Stream groupings

One of the steps in defining a topology is to define what streams each bolt receives as input. Stream grouping is used to define what a stream should do if it allocates data to multiple tasks on bolts.

There are seven types of stream grouping in Storm

Shuffle Grouping: randomly group and randomly distribute the tuple in the stream to ensure that the number of tuple received by each bolt is roughly the same. Fields Grouping: grouped by field, such as by userid, tuple with the same userid will be assigned to a task in the same Bolts, while different userid will be assigned to a task in a different bolts. All Grouping: broadcast transmission, for each tuple, all bolts will be received. Global Grouping: global grouping, this tuple is assigned to one of the task of a bolt in the storm. More specifically, it is assigned to the task with the lowest id value. Non Grouping: no grouping. This grouping means that stream doesn't care who will receive its tuple. At present, this grouping has the same effect as Shuffle grouping, except that storm will put the bolt in the same thread of the bolt subscriber to execute. Direct Grouping: direct grouping, which is a special grouping method, which means that the sender of the message specifies which task of the message receiver will process the message. Only message flows declared as Direct Stream can declare this grouping method. And the message tuple must be emitted using the emitDirect method. The message handler can get the id of the task that processes its message through TopologyContext (the OutputCollector.emit method also returns the id of the task). Local or shuffle grouping: if the target bolt has one or more task in the same working process, the tuple will be randomly generated to these tasks. Otherwise, it is consistent with the normal Shuffle Grouping behavior. 6 、 Reliability

Storm guarantees that each tuple will be fully executed by topology. Storm tracks the tuple tree generated by each spout tuple (a bolt may emit another tuple to form a tree structure after processing a tuple) and keeps track of when the tuple tree has been successfully processed. Each topology has a message timeout setting, and if storm cannot detect whether a tuple tree was executed successfully or not during this timeout, topology will mark the tuple as failed and retransmit the tuple later.

In order to take advantage of the reliability features of Storm, you must notify storm when you issue a new tuple and when you finish processing a tuple. All this is done by OutputCollector. A new tuple is notified by the emit method, and a tuple processing is notified by the ack method.

The reliability of Storm will be discussed in depth in Chapter 4.

7 、 Tasks

Each spout and bolt will be executed as many task throughout the cluster. Each executor corresponds to a thread on which multiple task runs, while stream grouping defines how to emit tuple from one pile of task to another. You can call the setSpout and setBolt of the TopologyBuilder class to set the degree of parallelism (that is, how many task there are).

8 、 Workers

A topology may be executed in one or more worker (worker processes), each worker is a physical JVM and executes part of the entire topology. For example, for topology with parallelism of300, if we use 50 worker processes to execute, then each worker process will process 6 of the tasks. Storm will try to distribute work evenly to all worker.

9 、 Configuration

Storm has a bunch of parameters that can be configured to adjust the behavior of Nimbus, Supervisor, and running topology, some at the system level and some at the topology level. There are all default configurations in default.yaml. You can override these default configurations by defining a storm.yaml in your classpath. And you can also set some topology-related configuration information in the code (using StormSubmitter).

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.