The way to optimize Spark Streaming-from Receiver to Direct Mode 07/12 Update SLTechnology News&Howtos

The way to optimize Spark Streaming-from Receiver to Direct Mode

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: a senior research and development engineer for push data

1 Business background

With the rapid development of big data, business scenarios are becoming more and more complex, offline batch processing framework MapReduce has been unable to meet the business, a large number of scenarios need real-time data processing results for analysis and decision-making. Spark Streaming is a distributed big data real-time computing framework, which provides dynamic, high-throughput and fault-tolerant streaming data processing, which can not only analyze user behavior, but also play a role in finance, public opinion analysis, network monitoring and so on. One push developer service-message push "push according to the situation" uses Spark Streaming technology, based on big data's analysis of crowd attributes, and uses LBS geographic fence technology to trigger accurate message push in real time to achieve fine operation of users. In addition, when using Spark Streaming to deal with kafka data in real time, Direct mode is used instead of Receiver mode to optimize resources and improve program stability.

This article will start with the two modes of Spark Streaming to obtain kafka data, combined with a push practice, take you to interpret the principles and characteristics of Receiver and Direct mode, as well as the optimization comparison from Receiver mode to Direct mode.

2 the principle and difference of the two modes Receiver mode 1. Operating architecture in Receiver mode

1) InputDStream: input data received from the stream data source.

2) Receiver: responsible for receiving the data stream and writing the data locally.

3) Streaming Context: represents SparkStreaming, responsible for task scheduling at Streaming level, generates jobs and sends it to Spark engine for processing.

4) Spark Context: represents Spark Core, responsible for task scheduling at the batch level, and Spark engine that actually executes job.

2. Receiver's process of pulling data from kafka

In this mode:

1) on executor, receiver receives data from kafka and stores it in Spark executor. After batch time, job is triggered to process the received data. One receiver takes up one core.

2) in order not to lose data, you need to enable the WAL mechanism, which will write a backup of the data received by receiver to a third-party system (such as HDFS)

3) use kafka High Level API internally in receiver to consume data and update offset automatically.

Direct mode 1. Operating architecture in Direct mode

Similar to the receiver pattern, the difference is that there is no receiver component in executor, and data is pulled from kafka in a different way.

2. Direct's process of pulling data from kafka

In this mode:

1) without receiver, there is no need for additional core to receive data continuously. Instead, the latest offset of each partition in the kafka is queried periodically, and each batch pulls the data of the last processed offset and the range of the current queried offset for processing.

2) in order not to lose data, there is no need to backup the data to the ground, but only need to save offset manually

3) kafka simple Level API is used internally to consume data, and manual maintenance is required. Offset will not be automatically updated on offset,kafka zk.

The difference between Receiver and Direct mode

1. The former has Receiver to accept data in executor, and one Receiver occupies one core; while the latter does not have Receiver, so it will not use core temporarily.

two。 The former InputDStream's partition is num_receiver * batchInterval/blockInteral, and the latter's partition number is the number of kafka topic partition. Unreasonable setting of num_receiver in Receiver mode will affect performance or cause waste of resources; if the setting is too small and parallelism is not enough, receiving data on the whole link will be a bottleneck; if there are too many settings, resources will be wasted

3. The former uses zookeeper to maintain the offset of consumer, while the latter needs to maintain the offset itself.

4. In order to ensure that the data is not lost, the former needs to turn on the WAL mechanism, while the latter does not need to update the offset after successfully consuming the data in the program.

3 Receiver is transformed into Direct mode

Spark Streaming is used to process kafka data in real time. Previously, receiver mode was used.

Receiver has the following characteristics:

In 1.receiver mode, each receiver needs to occupy a separate core.

two。 In order not to lose data, you need to enable the WAL mechanism and use checkpoint to save the state.

3. When the data rate accepted by receiver is higher than the data rate processed, the data backlog may eventually cause the program to hang up.

Due to the above characteristics, there will be a waste of resources in receiver mode; if you use checkpoint to save the state, if you need to upgrade the program, it will cause checkpoint to be unused; point 3: in receiver mode, the program will be unstable; and if the number of receiver is not set properly, it will also cause performance bottleneck in receiver. In order to optimize resources and program stability, receiver mode should be transformed into direct mode.

The modification method is as follows:

1. Modify the creation of InputDStream

Change receiver's:

Val kafkaStream = KafkaUtils.createStream (streamingContext, [ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])

Change it to direct:

Val directKafkaStream = KafkaUtils.createDirectStream [[key class], [value class], [key decoder class], [value decoder class]] (streamingContext, [map of Kafka parameters], [set of topics to consume])

two。 Manual maintenance of offset

Receiver mode code:

(receiver mode does not require manual maintenance of offset, but is internally submitted to kafka/zk via kafka consumer high level API to save)

KafkaStream.map {...} .foreachRDD {rdd = > / / data processing doCompute (rdd)}

Direct mode code:

DirectKafkaStream.map {.}. ForeachRDD {rdd = > / / get the offset val offsetRanges = rdd.asInstanceOf [HasOffsetRanges] .offsetRanges / / data processing doCompute (rdd) / / save offset commitOffsets (offsetRanges)} 4 other optimization points

1. In receiver mode:

1) split the InputDStream and increase the Receiver to increase the parallelism of the received data

2) adjust the blockInterval, reduce it appropriately, and increase the number of task, so as to increase the degree of parallelism (when the number of core > the number of task)

3) if the WAL mechanism is enabled, the storage level of the data is set to MOMERY_AND_DISK_SER.

two。 Data serialization uses Kryoserializationl, which is faster than Java serializationl and the serialized data is smaller.

3. It is recommended to use CMS garbage collector to reduce GC overhead.

4. Select high performance operators (mapPartitions, foreachPartitions, aggregateByKey, etc.)

The use of 5.repartition: in streaming programs, the amount of data is generally small because the batch time is very short, so the repartition time is short, which can solve some data skew problems caused by uneven data distribution in topicpartition.

6. Because the job produced by SparkStreaming ultimately runs on sparkcore, the optimization of sparkCore is also very important.

7.BackPressure flow control

1) Why is Backpressure introduced?

When the situation of batch processing time > batchinterval lasts too long, it will cause data to accumulate in memory and lead to problems such as memory overflow of Executor where Receiver is located.

2) Backpressure: dynamically adjust the data reception rate according to the execution information of the JobScheduler feedback job

3) configure to use:

Meaning of spark.streaming.backpressure.enabled: whether to enable the backpressure mechanism within SparkStreaming. Default value: false, meaning that spark.streaming.backpressure.initialRate is disabled: the ratio when receiver is the first batch to receive data spark.streaming.receiver.maxRate meaning: the maximum ratio of data received by receiver, if the value is set

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.