How to set the concurrency of Storm 07/13 Update SLTechnology News&Howtos

How to set the concurrency of Storm

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how to set the concurrency of Storm". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Storm Architecture: master/slave

Master node: Nimbus

Responsible for Topology distribution, resource scheduling and monitoring on the cluster

Work node: Supervisor

After receiving a task request, start one or more Worker processes to process the task; by default, a Supervisor starts up to 4 Worker

Working process: Worker

In the child process in Supervisor, there are several Spout and Bolt threads that are responsible for the Spout and Bolt component processing tasks (actually the executor threads that are open)

Assignment: Topologies (endless cycle, will not end)

Spout: the component that gets the data

Bolt: a component that processes data

The channel for data flow between Stream:Spout and Bolt

Tuple:

1) the smallest constituent unit of Stream. Spout sends data to Bolt at a time called a Tuple.

2) the type of Tuple in the same Stream is the same, but may be the same / different in different Stream

3) A Map in key-value form

Data flow Distribution Policy (Stream groupings):

Resolve the problem of data transfer (sending Tuple tuples) between Spout and Bolt

1) shuffleGrouping:

Randomly dispatch Tuple from Stream to Bolt

2) fieldsGrouping:

The module operation is carried out according to the hash value of the field and the number of Bolt, and then the packet is sent. A node is a Worker, a Bolt is a task, and the number of Spout or Bolt of all nodes is called concurrency.

Storm concurrency setting:

1.Worker concurrency:

First of all, it is set according to the size and physical location of the cluster.

Generally speaking, Worker is evenly distributed to each node, and a supervisor is set to a Worker by default.

2.Spout quantity setting:

By default, the total number of Spout is equal to the number of partitions corresponding to Kafka (message middleware) corresponding to Topic, which increases the throughput speed.

Generally, one Worker sets up one Spout.

3.Bolt1 quantity setting:

First of all, set it according to the amount of data and the time it takes to process the data.

In general, the number of Bolt1 is twice the number of Spout (modified according to the project)

4.Bolt2 quantity setting:

First of all, it is set according to the amount of data and the time to process the data, because the intermediate result data transmitted by Bolt1 has been reduced a lot, and the number of Bolt2 can be reduced as appropriate.

Fault-tolerant mechanism: XOR

TupleId-New data is generated, resulting in a tupleId

The tupleId in the whole process is pairwise or to the end.

If the result is yes, the data is correct, otherwise it is wrong

MessageId-represents the whole piece of information, specified in API for programmers, long type

RootId-represents a piece of information and is provided to the storm framework

There are two cases in which the data operation fails:

Execute () {

1. Exception (data exception)

two。 The task timed out-thought the processing failed

}

How to solve the problem of repeated data transmission caused by data transmission?

Ⅰ.

1. For example, the order information is processed. After successful processing, the order information is stored in ID to Redis (set).

two。 When the message is sent, determine whether the information has been processed.

Execute () {

If ()

Else ()

}

Ⅱ.

No processing: clickstream daily log analysis: pv, uv

Index analysis: number of orders, amount of orders

Message reliability guarantee and acker mechanism: open / nextTuple / ack / fail/ close

Ⅰ .Spout class:

When sending a tuple, Spout provides a msgId to later identify the tuple tree that tuple;Storm will create based on msgId tracking until a tuple is fully processed. According to msgId, the ack () method in the Spout that originally sent tuple is called, and the fail () method is called when the timeout is detected-- the invocation of these two methods must be performed by the Spout that originally created the tuple. When Spout fetches a piece of data from the message queue (Kafka/RocketMQ), it is not actually taken out, but maintains a suspended state, waiting for the message to complete. The suspended state information will not be sent to other consumers. When the message is "fetched", the queue provides the message body data and a unique msgId to the client, and when the ack () / fail () method of Spout is called, the Spout removes / replaces the message from the queue based on the id sent to the queue.

Ⅱ .acker task:

Efficient implementation reliability-the ack () and fail () methods defined in Spout must be explicitly called in Bolt. The Storm topology has some special tasks called "acker", which are responsible for tracking the DAG of tuple sent by Spout. When an acker finds that the DAG ends, it sends a message to the Spout task that created the Spout tuple, asking the task to answer the message. Acker does not track the tuple tree directly, but stores a table in the acker tree that maps the id of the Spout tuple to a pair of values. Id is the task of creating the tuple, and the second value is the number of a 64bit (ack val), which is the result of the XOR operation of the tuple id of all the created or answered tuple in the tree.

Ⅲ. Remove reliability:

1. Set Config.TOPOLOGY_ACKERS to

two。 Omit the message id in the SpoutOutputCollector.emit method to turn off the tracking function of spout tuple

3. Choose to send an "unanchored" (unanchored) tuple when sending a tuple

This is the end of the content of "how to set the concurrency of Storm". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.