What are the entry-level knowledge points of storm 07/06 Update SLTechnology News&Howtos

What are the entry-level knowledge points of storm

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "what are the entry points of storm knowledge". In daily operation, I believe many people have doubts about what problems there are in the entry points of storm knowledge. The editor has consulted all kinds of materials and sorted out simple and easy operation methods. I hope to help you answer the doubts of "what are the entry points of storm knowledge"! Next, please follow the small series to learn together!

1.1 real-time stream computation

From the first time the Internet was born, the biggest change to the world was to allow information to interact in real time, thus greatly accelerating the efficiency of each link. Because of the demand for real-time response and real-time interaction of information, in addition to personal operating systems, databases (more precisely relational databases) should be the fastest growing and most profitable products in the software industry. I remember ten years ago, many banks couldn't even do real-time inquiry, let alone real-time transfer, but databases and high-speed networks changed this situation.

With the further development of the Internet, from Portal information browsing type to Search information search type to SNS relationship interactive transmission type, as well as e-commerce, Internet tourism life products, etc., the circulation link in life will be online. The requirements for efficiency further enhance everyone's requirements for real-time, and the interaction and communication of information are developing from point-to-point to information chain or even information network, which inevitably leads to cross-correlation of data in various dimensions, and data explosion is inevitable. Therefore, streaming processing and NoSQL products came into being to solve the problems of real-time framework and large-scale data storage calculation respectively.

As early as 7 or 8 years ago, universities such as UC Berkeley and Stanford began to study streaming data processing, but due to more attention to business scenarios in the financial industry or Internet traffic monitoring, as well as the limitations of Internet data scenarios at that time, most of the research was based on streaming of traditional database processing, and the research on streaming framework itself was less. At present, such research has gradually lost its voice, and more efforts in the industry have turned to real-time databases.

2010 Yahoo! The open-sourcing of S4, and Twitter's open-sourcing of Storm in 2011, changed that. In the past, when Internet developers were doing a real-time application, they should not only pay attention to the application logic calculation processing itself, but also worry about the real-time flow, interaction and distribution of data. But now the situation is very different, take Storm as an example, developers can quickly build a robust, easy-to-use real-time stream processing framework, with SQL products or NoSQL products or MapReduce computing platform, you can make a lot of low-cost real-time products that were difficult to imagine before: For example, many products under the Quantum Hengdao brand of Yitao Data Department are built on the real-time stream processing platform.

This tutorial is a basic introduction to Storm, but we hope it is not just a manual for Storm, we will add more of our experience in the actual data production process and application architecture, the ultimate goal is to help all technical colleagues willing to use the real-time stream processing framework, but also quietly change the world.

1.2 Storm Features

Storm is an open-source distributed real-time computing system that handles massive data streams simply and reliably. Storm has many usage scenarios: real-time analytics, online machine learning, continuous computing, distributed RPC, ETL, and more. Storm supports horizontal scaling and is highly fault-tolerant, ensuring that every message is processed and that it is fast (millions of messages per second per node in a small cluster). Storm is easy to deploy and operate, and more importantly, it can be developed in any programming language.

Storm has the following characteristics:

Simple programming model

In terms of big data processing, I believe everyone is familiar with hadoop. Hadoop based on Google Map/Reduce provides map and reduce primitives for developers, making parallel batch processing programs very simple and beautiful. Storm also provides some simple and elegant primitives for real-time computation of big data, which greatly reduces the complexity of developing parallel real-time processing tasks and helps you develop applications quickly and efficiently.

scalable

There are three main entities that actually run topology in Storm clusters: worker processes, threads, and tasks. Each machine in Storm cluster can run multiple worker processes, each worker process can create multiple threads, each thread can execute multiple tasks, tasks are the entities that actually process data, and the spouts and bolts we develop are executed as one or more tasks.

As a result, compute tasks are performed in parallel across multiple threads, processes, and servers, enabling flexible horizontal scaling.

high reliability

Storm ensures that every message sent by spouts is "fully processed," which is also a direct difference from other real-time systems, such as S4.

Please note that the message sent by spout may trigger thousands of messages subsequently, which can be visually understood as a message tree. The message sent by spout is the root of the message tree. Storm will track the processing of this message tree. Only when all messages in this message tree have been processed will Storm consider that the message sent by spout has been "fully processed." If any of the messages in the tree fail to be processed, or if the tree is not "fully processed" within a limited time, then the message sent by spout will be retransmitted.

In order to reduce the consumption of memory as much as possible, Storm does not track every message in the message tree, but adopts some special strategies. It tracks the message tree as a whole, and performs exclusive OR calculation on the unique id of all messages in the message tree. It determines whether the message sent by spout is "fully processed" by whether it is zero. This greatly saves memory and simplifies the decision logic. This mechanism will be described in detail later.

This mode, every time a message is sent, an ack/fail will be sent synchronously, which will have a certain consumption of network bandwidth. If the reliability requirement is not high, this mode can be turned off by using different emit interfaces.

As mentioned above, Storm guarantees that each message is processed at least once, but for some computing situations, it will strictly require that each message is processed only once. Fortunately, Storm 0.7.0 introduces transactional topology and solves this problem.

high fault tolerance

If something unexpected happens during message processing, Storm rearranges the offending processing unit. Storm guarantees that a processing unit runs forever (unless you explicitly kill it).

Of course, if intermediate states are stored in the processing unit, then when the processing unit is restarted by Storm, it needs to apply its own intermediate state recovery.

supports several programming languages

In addition to implementing spouts and bolts in java, you can do the job in any programming language you're familiar with, thanks to Storm's so-called multilingual protocol. Multilanguage protocol is a special protocol inside Storm that allows spouts or bolts to pass messages using standard input and standard output, either as a single line of text or as multiple lines of json encoding.

Storm supports multilingual programming primarily through ShellBolt, ShellSpout, and ShellProcess classes that implement the IBolt and ISpout interfaces, as well as protocols that allow shells to execute scripts or programs through java's ProcessBuilder class.

It can be seen that in this way, each tuple needs to be encoded and decoded by json when processing, so it will have a greater impact on throughput.

Native mode support

Storm has a "native mode" that simulates all the functionality of a Storm cluster in the process. Running topology in native mode is similar to running topology on a cluster, which is very useful for our development and testing.

efficient

Using ZeroMQ as the underlying message queue ensures that messages can be processed quickly

At this point, the study of "what are the entry points of storm knowledge" is over, hoping to solve everyone's doubts. Theory and practice can better match to help you learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.