What points should be paid attention to when using storm? 07/19 Update SLTechnology News&Howtos

What points should be paid attention to when using storm?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what points should be paid attention to in the use of storm". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what points to pay attention to in the use of storm".

Personal understanding: storm is a distributed, real-time, streaming, computing, platform, several features have been seen in the name.

First, real-time, simple understanding is that the data into the system should be processed quickly, that is, the delay should be small.

Second, what are the characteristics of flow? imagine you are standing on the bank of the Yangtze River. what is the feeling and shock? Mighty? I haven't seen it. The flow I understand is that ① does not block the ② direction, only from high to low ③ flow without influence. ④ can flexibly change the direction of the flow (dig a trench, corresponding to storm's grouping mechanism) ⑤ uninterrupted.

Third, computing, the concept is actually quite broad, computer processing of anything generally involves computing, here I understand that there is no blocking, that is, more inclined to computing, do not exchange with too many external resources, such as networks, disks.

Fourth, distributed, distributed everyone's ears are cocoon, the individual has a personal understanding, a more important indicator of distribution is scalability (linear, non-linear), resources transparent to services or users, have a certain load balancing ability, and make full use of resources? (it seems that many domestic hadoop players use powerful servers, and it seems that hadoop comes out to take advantage of those outdated and old machines), preferably out of order.

5. The platform, which is easy to understand, is a platform that provides the above four main features. Then services that meet the requirements of this platform can dance on this platform, and a good platform can change our destiny.

Reliability, distributed system so many components, a problem storm can be automatically restored

Adapt to the scene

From the above five points, it can be concluded that storm pays attention to real-time, emphasis on computing, distributed characteristics, linear scalability, stream characteristics. Of course, with these features, we ultimately have to rely on business logic. I wrote an article about the possibility of doing a crawler based on storm. I think it is better to use storm in the crawler scene, because the crawler usually gets data from the network 24 hours a day, which constitutes a stream, and generally requires real-time, that is, to get the latest content into his pocket as soon as possible. The amount of computation also involves the processing of all kinds of data, and then it can be linearly scalable smoothly after encountering a bottleneck, without stopping the service. The most important point is that the processing between reptile seeds is not related, which is completely in line with the distributed characteristics. In short, wait, wait,

What points do you pay attention to (true knowledge comes from practice? )

In the case of a worker being kill, if the worker is not related to other worker, it seems that other worker will not be affected, and the worker of the kill will be started by storm. If it is related to other worker, then other worker will also be affected and task will be restarted (similar to the effect of rebalance, as if it is the same as the process of rebalance).

Acker seems to be the nearest principle. The messages of a spout are tracked by local acker first. The test results show that once acker is kill, when the acker is started again, the messages tracked before will fail quickly.

Acker cannot be rebalance.

We all know that the nextTuple and ack or fail methods in storm are executed in a single thread. After testing, it is observed that storm will first call spout's nextTuple () and then check ack (mid) or fail (mid), together with storm's pending mechanism, which we sometimes need to note that there may be problems similar to deadlocks under specific requirements. It is not detailed here, you can leave a message to me if necessary, because there has only been a problem with our specific needs.

Storm does not recommend that you new new threads in bolt or spout, but sometimes it is inevitable for asynchronous processing. After testing, it is observed that the priority of the thread started by storm itself is relatively high. If you want your thread to be fully called, you need to set the priority of the thread you created to a higher level (I generally go straight to the maximum, and then use the lock mechanism, and then there is a difference between windows and UNIX-like system jvm thread priority. You can learn more about this.

Because different machines may have different resources, the storm default scheduler mechanism is basically an equal distribution mechanism, which will lead to serious data disorder.

Storm first start your spout and then start the bolt task,spout on different machines in turn will not wait for bolt to start to complete before sending data, then if you have a bolt initialization takes 1 minute, and then your message timeout is set to 30s, then the message before the bolt startup is completed will cause the message timeout because the spout timeout setting is relatively short, resulting in spout fail (mid). Well, I admit I misunderstood this before. Storm executes the open of spout and the prepare of each bolt, initializing the spout (Activate spout) in turn, and then the spout worker thread starts executing nextTuple.

Setting the TOPOLOGY_RECEIVER_BUFFER_SIZE of topology to 16 is faster, at least when the overall speed of our business is 1w +, setting it to 16 is faster than 8 or 32. Of course, this value may be different in different scenarios (to emphasize here, there are four buffer in storm, which is awkward to understand, and the four buffer are not symmetrical, so it is recommended to read official documents or learn in an authoritative way.)

That's all first. If you remember anything, I'll add it later.

Bolt processing error is not easy to pass to spout, that is, a tuple processing fails. Spout can only know the id of the failed message, but does not know the specific reason for the failure. I do not think this is very flexible. Of course, this problem will not occur in a worker, only in a distributed environment.

Bolt emit A tuple can be anchor (anchored) to a list, but the ack or fail method of collector can only receive a single tuple and cannot directly ack/fail a list, as shown in the following code:

Collector.emit (TopoUtil.StreamId.DEFAULT, list, new Values (sonSMessage)); for (Tuple tuple: list) {collector.ack (tuple);}

There is something wrong with storm rebalance after testing, that is, the-n parameter can only be less than rebalance from the number of processes submitted at the beginning, but not from 2rebalance to 3 or 4 or greater. The-e parameter does not have this problem, but the number of threads cannot be greater than the number of task. You should pay attention to this after storm.

Thank you for your reading, the above is the content of "what points to pay attention to in the use of storm". After the study of this article, I believe you have a deeper understanding of what points you should pay attention to in the use of storm, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.