Spark Streaming Note arrangement (1): an introduction to the basic working principles 04/10 Update SLTechnology News&Howtos

Spark Streaming Note arrangement (1): an introduction to the basic working principles

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

An overview of big data's real-time calculation

1. Spark Streaming, in fact, is a framework provided by Spark for real-time computing of big data. Its underlying layer, in fact, is also based on the Spark Core we explained earlier. The basic computing model is still the memory-based big data real-time computing model. Moreover, its underlying core component is the RDD that we often use in Spark Core.

2. According to the characteristics of real-time computing, a layer of encapsulation is carried out on top of RDD, which is called DStream. In fact, after learning Spark SQL, it is easy for you to understand this kind of packaging. Previous study of Spark SQL also found that it provides a new concept based on RDD, DataFrame, for applications such as data query, but its underlying layer is still based on RDD. Therefore, RDD is the core of the whole Spark technology ecology.

Just as there are many stream processing engines available on the market, people often ask us what are the unique advantages of Spark Streaming? So the first thing to say is that Apache Spark provides native support for batch and streaming. This is different from other systems in that the processing engines of other systems either focus only on stream processing or are only responsible for batch processing and provide only stream processing API interfaces that require external implementation. Spark can realize batch and stream processing with its execution engine and unified programming model, which is the unique advantage of Spark Streaming compared with traditional streaming systems. It is particularly reflected in the following four important parts:

1. Can quickly restore the state in the case of fault error and straggler; 2. Better load balancing and resource utilization; 3. Integration and interactive query of static data set and stream data; 4. Built-in rich advanced algorithm processing libraries (SQL, machine learning, graph processing) stream processing architecture-past and present 1

The current execution of the distributed flow processing pipeline is as follows:

1. Receive the stream data from the data source (such as time log, system telemetry data, Internet of things device data, etc.), and process it into a data intake system, such as Apache Kafka, Amazon Kinesis and so on.

2. Process data in parallel on the cluster. This is also the key to designing a flow processing engine, which we will discuss in more detail below.

3. The output results are stored in the downstream system (such as HBase, Cassandra, Kafka, etc.).

In order to process these data, most traditional stream processing systems are designed as continuous operator models, which work in the following ways:

1. There are a series of working nodes, and each group of nodes runs one or more continuous operators.

2. For stream data, each continuous operator processes one record at a time and transfers the record to other operators in the pipeline.

3. The source operator receives data from the intake system and then outputs it to the downstream system.

Stream processing architecture-past and present 2

1. Continuous operator is a relatively simple and natural model. However, with the continuous expansion of data scale and more and more complex real-time analysis in the era of big data, this traditional architecture is also facing severe challenges. Therefore, we designed Spark Streaming to address the following requirements:

2. Rapid fault recovery-the larger the data, the higher the probability of node failure and slower node operation (such as straggler). Therefore, if the system can give the results in real time, it must be able to repair the fault automatically. Unfortunately, in traditional flow processing systems, it is still a challenge for the continuous operators statically assigned to these work nodes to complete this task quickly.

3. Load balancing-the bottleneck (running bottleneck) of some nodes' performance will be caused by unbalanced load distribution among working nodes in continuous operator system. These problems are more common in the face of large-scale data and dynamically changing workloads. In order to solve this problem, the system must be able to dynamically adjust the resource allocation among nodes according to the workload.

4. Unified streaming and batch processing and interworking-in many use cases, interaction with streaming data is necessary (after all, all streaming systems put this in memory) or combined with static datasets (such as pre-computed model). These are difficult to achieve in the continuous operator system, when the system dynamically adds new operators, there is no temporary query function designed for it, which greatly weakens the interaction ability between users and the system. So we need an engine that can integrate batch processing, streaming and interactive queries.

5. Advanced analysis (such as machine learning, SQL queries, etc.)-some more complex tasks require constant learning and updating of the data model, or the use of SQL to query the latest characteristic information in the stream data. Therefore, there needs to be a common integrated abstract component in these analysis tasks to make it easier for developers to do their work.

6. In order to solve these requirements, Spark Streaming uses a new structure, which we call discretized streams (discrete Stream data processing), which can directly use the rich libraries in the Spark engine and has excellent fault tolerance mechanism.

Brief introduction of Spark Streaming

1. Spark runs in various and flexible modes. When deployed on a single machine, it can run either in local mode or in pseudo-distributed mode. When deployed in a distributed cluster, there are many running modes to choose from, depending on the actual situation of the cluster. The underlying resource scheduling can not only rely on the external resource scheduling framework, but also use the built-in Standalone mode of Spark. For the support of the external resource scheduling framework, the current implementation includes the relatively stable Mesos pattern and the Hadoop YARN pattern which is still under continuous development and update.

2. Spark Streaming is an extension of Spark Core API, which can be used to process large-scale, high-throughput, fault-tolerant real-time data streams. It supports reading from a variety of data sources, such as Kafka, Flume, Twitter, ZeroMQ, Kinesis, ZMQ, or TCP Socket. And can use complex algorithms similar to higher-order functions for data processing, such as map, reduce, join and window. The processed data can be saved to file system, database, Dashboard and other storage.

Basic working principle of Spark Streaming

Receive the real-time input data stream, and then split the data into multiple batch. For example, each second of collected data is encapsulated into a batch, and then each batch is handed over to the computing engine of Spark for processing. Finally, a resulting data stream is produced, in which the data is also composed of a batch.

Spark Streaming DStream

1. Spark Streaming provides a high-level abstraction called DStream, the English full name is Discretized Stream, and the Chinese translation is "discrete flow", which represents a continuous data stream. DStream can be created by entering data sources, such as Kafka, Flume, ZMQ, and Kinesis;, or by applying higher-order functions to other DStream, such as map, reduce, join, window.

2. Inside the DStream, in fact, a series of continuous production of RDD. RDD is the core abstraction of Spark Core, that is, immutable, distributed data sets. Each RDD in DStream contains data over a period of time.

1. Operators applied to DStream, such as map, are actually translated into operations on each RDD in DStream at the underlying level. For example, performing a map operation on a DStream produces a new DStream. At the bottom, however, the principle is that the map operation is applied to the RDD for each time period in the input DStream, and then the new RDD is generated as a RDD for that time period in the new DStream. The transformation operation of the underlying RDD.

2. It is realized by the computing engine of Spark Core. Spark Streaming encapsulates Spark Core in a layer, hides the details, and then provides developers with a high-level API that is easy to use.

Comparative analysis with other streaming frameworks Spark Streaming, Flink and Storm comparison point StormSpark StreamingFlink real-time computing model pure real-time, one piece of data processing 1, quasi-real-time, collecting RDD data for a period of time, processing streaming computing and batch processing using DataStream and DataSet real-time computing delay millisecond low throughput high transaction mechanism support perfect support, but not perfect support But not perfect robust / fault-tolerant ZK, Acker, good CheckPoint,WAL general CheckPoint general dynamic adjustment of parallelism support run-time support while supporting outflow and offline processing does not support maturity high-level model nativeMicro-batchingnativeAPI combinatorial declarative combinatorial combination: operation more basic API operation, step-by-step fine control, each component combination is defined as topology Declarative: provides encapsulated high-order functions. Preliminary optimization can be provided after encapsulation; advanced operations such as window management and state management can be provided; advantage analysis of Spark Streaming, Flink and Storm

1. Spark Streaming is by no means better than Storm and Flink. These two frameworks are excellent in the field of real-time computing, but they are not good at subdividing scenarios.

2. Spark Streaming is better than Storm in throughput.

3. In terms of real-time delay, Storm is much better than Spark Streaming, the former is pure real-time, the latter is quasi-real-time. Moreover, the transaction mechanism, robustness / fault tolerance, dynamic parallelism and other features of Storm are better than Spark Streaming.

4. Spark Streaming, there is one thing that Storm can never compare with, that is, it is located in the entire ecological technology stack of Spark, so Spark Streaming can be seamlessly integrated with Spark Core, Spark SQL and Spark Graphx. In other words, we can immediately seamlessly carry out delayed batch processing, interactive query and other operations in the program for the intermediate data processed in real time. This feature greatly enhances the advantages and functions of Spark Streaming.

Application scenario Storm of Spark Streaming, Flink and Storm

1. It is recommended to use it in scenarios where pure real-time is required and cannot tolerate a delay of more than 1 second, such as real-time computing systems that require pure real-time trading and analysis.

2. In the function of real-time computing, reliable transaction mechanism and reliability mechanism are required, that is, the processing of data is completely accurate, neither more nor less, and we can also consider using Storm, but Spark Streaming can also ensure that the data is not lost.

3. If we need to dynamically adjust the parallelism of real-time computing programs for peak and low peak periods to maximize the use of cluster resources (usually in small companies, where cluster resources are tight), we can also consider using Storm

Spark Streaming

1. If the above three requirements are not met, we can consider using Spark Streaming for real-time calculation.

2. One of the most important factors to consider the use of Spark Streaming should be a macro consideration for the whole project, that is, if a project includes business functions such as offline batch processing, interactive query, graph computing and MLIB machine learning in addition to real-time computing, and high latency batch processing, interactive query and other functions may be involved in real-time computing, then Spark ecology should be preferred. Using Spark Core to develop offline batch processing, Spark SQL to develop interactive query, and Spark Streaming to develop real-time computing, the three can be seamlessly integrated to provide a very high scalability to the system.

Flink1. Support high throughput, low latency, high performance stream processing 2. Support window with event time (Window) operation 3. Exactly-once semantics supporting stateful computing 4. Support for highly flexible Window operations, support for time, count, session, and data-driven-based window operations 5. 5. Supports the continuous flow model with Backpressure function 6. Support for fault tolerance based on lightweight distributed Snapshot (Snapshot) 7. A runtime supports both Batch on Streaming processing and Streaming processing. 8.Flink implements its own memory management in JVM. Support iterative calculation 10. Support for automatic optimization: avoid expensive operations such as Shuffle and sorting under specific circumstances, and it is necessary to cache intermediate results

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.