Ma Huateng talks about "three frameworks handled by streaming big data: Storm,Spark and Samza" 04/21 Update SLTechnology News&Howtos

Ma Huateng talks about "three frameworks handled by streaming big data: Storm,Spark and Samza"

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Apache Storm

In Storm, it is necessary to design a graph structure for real-time computing, which is called topology. This topology will be submitted to the cluster, where the master node (master node) distributes the code and assigns tasks to the worker node (worker node) for execution. A topology includes two roles, spout and bolt, in which spout sends messages and is responsible for sending data streams in the form of tuple tuples, while bolt is responsible for converting these data streams, which can be calculated and filtered in bolt, and bolt itself can randomly send data to other bolt. The tuple emitted by spout is an invariant array corresponding to a fixed key-value pair.

Apache Spark

Spark Streaming is an extension of the core Spark API that does not process data streams one at a time as Storm does, but instead splits them into segments of batch jobs at intervals before processing. The abstraction of Spark for persistent data flow is called DStream (DiscretizedStream), a DStream is a micro-batching RDD (resilient distributed dataset), while RDD is a distributed dataset that can operate in parallel in two ways, namely the conversion of arbitrary functions and sliding window data.

Apache Samza

When Samza processes the data stream, it processes each received message individually. The flow unit of Samza is neither tuple nor Dstream, but messages. In Samza, the data flow is split, and each part consists of an ordered sequence of read-only messages, each of which has a specific ID (offset). The system also supports batch processing, that is, processing multiple messages of the same data flow partition one at a time. Samza's execution and data flow modules are pluggable, although Samza is characterized by its reliance on Hadoop's Yarn (another resource scheduler) and Apache Kafka.

What they have in common

The above three real-time computing systems are open source distributed systems with the advantages of low latency, scalability and fault tolerance. the common feature of them is that they allow you to run data flow code. Assign tasks to a series of fault-tolerant computers to run in parallel. In addition, they all provide a simple API to simplify the complexity of the underlying implementation.

The terms of the three frameworks are different, but the concepts they represent are very similar:

Contrast picture

The following table summarizes some differences:

The forms of data transmission can be divided into three categories:

At-most-once at most: messages can be lost, which is usually the least desirable outcome.

At least once (At-least-once): the message may be sent again (there is no loss, but redundancy occurs). It is sufficient in many use cases.

Just once (Exactly-once): each message is sent once and only once (no loss, no redundancy). This is the best-case scenario, although it is difficult to guarantee that it will be implemented in all use cases.

Another aspect is state management: there are different strategies for state storage, Spark Streaming writes data to distributed file systems (such as HDFS), Samza uses embedded key-value storage, and in Storm, state management is scrolled to the application level, or higher-level abstract Trident is used.

Use case

All three frameworks are excellent and efficient in dealing with large amounts of continuous real-time data, so which one should be used? There are no hard and fast rules when choosing. At most, there are a few guidelines.

If you want a high-speed event processing system that allows incremental computing, Storm will be the best choice. It can meet the needs of further distributed computing while you are waiting for the results on the client side, using out-of-the-box distributed RPC (DRPC). Last but not least: Storm uses Apache Thrift, and you can write topologies in any programming language. If you want the state to persist and / or achieve exactly one-time delivery, you should take a look at the higher level of Trdent API, which also provides a way of microbatch processing.

Companies that use Storm include Twitter, Yahoo, Spotify and The Weather Channel.

Speaking of microbatching, if you must have stateful computing, just once delivery, and do not mind high latency, then consider Spark Streaming, especially if you are also planning graphics operations, machine learning or access to SQL, Apache Spark's stack allows you to combine some library with data flow (Spark SQL,Mllib,GraphX), which will provide a convenient all-in-one programming model. In particular, data flow algorithms (for example, K-means streaming media) allow the promotion of Spark real-time decision-making.

Companies that use Spark include Amazon, Yahoo, NASA JPL,eBay and Baidu.

If you have a large number of states to deal with, such as many gigabytes per partition, you can choose Samza. Because Samza puts storage and processing on the same machine, it doesn't load extra memory while keeping the processing efficient. This framework provides flexible pluggable API: its default execution, messaging, and storage engine operations can be replaced at any time according to your choice. In addition, if you have a large number of data flow processing phases and different teams from different code bases, then Samza's fine-grained features are particularly useful because they can be added or removed with minimal impact.

Companies that use Samza include: LinkedIn,Intuit,Metamarkets,Quantiply,Fortscale and so on.

Conclusion

In this article, we only have a simple understanding of these three Apache frameworks, and do not cover a large number of functions and more subtle differences in these frameworks. At the same time, the comparison of these three frameworks is also limited, because these frameworks are constantly developing, which we should keep in mind.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.