How to analyze the two ways of integrating SparkStreaming and Kafka 07/01 Update SLTechnology News&Howtos

How to analyze the two ways of integrating SparkStreaming and Kafka

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how to parse SparkStreaming and Kafka integration in two ways. The content is concise and easy to understand. It will definitely make your eyes shine. I hope you can gain something from the detailed introduction of this article.

Spark Streaming is a microbatch-based streaming computing engine, usually using Spark Core or Spark Core in conjunction with Spark Sql to process data. In enterprise real-time processing architecture, Spark Streaming and Kafka integration are often regarded as one of the core links of the whole big data processing architecture.

For different versions of Spark and Kafka, there are two ways to integrate data processing: Receiver based Approach and Direct Approach. For support of different integration versions, please refer to the following figure:

Receiver based Approach

The receiver-based approach is implemented using the kafka consumer high-level API.

For all receivers, the data it receives through kafka will be stored in spark's executors, the bottom layer is written to BlockManager, and a block is generated by default 200ms (determined by the configuration parameter spark.streaming.blockInterval). BlockRdd is then built from jobs submitted by spark streaming, which eventually runs as a spark core task.

Regarding the receiver mode, there are the following points to note:

receiver as a resident thread scheduled to run on the executor, occupying a cpu

The number of receivers is determined by the number of calls to KafkaUtils.createStream, one receiver at a time

The topic partition in kafka does not correlate to the rdd partition in spark streaming

Increasing the number of topic partitions specified in KafkaUtils.createStream() only increases the number of threads consumed by a single receiver, it does not increase the number of parallel sparks in the processing data [topicMap[topic,num_threads]map value corresponds to the number of consuming threads corresponding to each topic]

By default, receiver generates a block in 200 ms. It is recommended to adjust the block generation cycle according to the data size.

The data received by the receiver will be put into the BlockManager. Each executor will have a BlockManager instance. Due to the data locality, those executors with receivers will be scheduled to execute more tasks, which will cause some executors to be idle.

It is recommended to adjust the data locality with the parameter spark.locality.wait. If this parameter is set to 10 and task 2s is processed, more and more tasks will be scheduled to execute on executors where data exists, resulting in slow or even failure of task execution (to be distinguished from data skew).

Multiple kafka input DStreams can be created using different groups, topics, and multiple receivers to receive and process data

Two types of reliable receivers:

reliable receiver sends an ack to a reliable data source accurately when data is received and stored in spark via replication mechanisms unreliable receiver:

An unreliable receiver does not send an acknowledgement of receipt to the data source. This applies to data sources that do not support ack. Of course, we can also customize the receiver.

receiver handles data reliability By default, receiver is the one that may lose data.

You can enable pre-write logging by setting spark.streaming.receiver.writeAheadLog.enable to true, writing data to a reliably distributed file system such as hdfs first, ensuring that data is not lost, but some performance is lost

Limiting the maximum rate of consumer consumption involves three parameters:

spark.streaming.backpressure.enabled: default is false, set to true, the backpressure mechanism is turned on;

spark.streaming.backpressure.initialRate: default does not set the initial consumption rate, the maximum value of each receiver receiving data at the first startup;

spark.streaming.receiver.maxRate: default value is not set, the maximum rate at which each receiver receives data (records per second). Each stream will consume up to this number of records per second, setting this configuration to 0 or negative will not limit the maximum rate

When a job is generated, all blocks within the valid range of the current job are grouped into a BlockRDD, and one block corresponds to one partition.

In kafka082 version of the consumer high-level API, there is the concept of grouping. It is recommended to keep the number of threads (number of consumers) in the consumer group consistent with the number of kafka partitions. If more than the number of partitions, some consumers will be idle

Direct Approach

Direct approach is a way of integrating kafka without receiver in spark streaming, which is generally used more in enterprise production environments. Compared to receiver, it has the following characteristics:

1. Do not use receiver

No need to create multiple kafka streams and aggregate them

Reduce unnecessary CPU usage

Reduce the whole process of receiving data from receiver and writing it to BlockManager, and then obtaining data through blockId, network transmission, disk reading, etc. at runtime, and improve efficiency

No WAL required, further reducing disk IO operations

2. The rdd generated by direct mode is Kafka RDD, and its partition number is consistent with the kafka partition number. As many rdd partitions are consumed, it is more convenient for us to control the parallelism.

Note: rdd generated after shuffle or repartition operation, this correspondence will be invalid

3. Offset can be manually maintained to implement exactly once semantics

4. Data locality issues. In KafkaRDD, in the compute function, use SimpleConsumer to read kafka data according to the specified topic, partition, and offset.

However, after version 010, there is a problem of data locality if kafka and spark are in the same cluster.

5. Limit the maximum rate of consumer consumption

spark.streaming.kafka.maxRatePerPartition: Maximum rate (records per second) at which data is read from each kafka partition. This is a speed limit for each partition, and it is necessary to know the number of kafka partitions in advance to evaluate the throughput of the system.

These are two ways to parse SparkStreaming and Kafka integrations. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserves, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.