SparkStreaming complements the integration of kafka 07/08 Update SLTechnology News&Howtos

SparkStreaming complements the integration of kafka

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

(1) comparison of two ways of integrating kafka with SparkStreaming

Analysis of the advantages and disadvantages of Direct:

Advantages: simplified parallelism (Simplified Parallelism). There is no need to create and union multiple input sources. The partition of Kafka topic corresponds to the partition of RDD. Efficient (Efficiency). The Receiver-based-based approach to ensure zero data loss (zero-data loss) requires the configuration of spark.streaming.receiver.writeAheadLog.enable=true, which requires saving two copies of data, and the waste of storage space also affects efficiency. The Direct approach does not have this problem. Strong consistent semantics (Exactly-once semantics). High-level data is consumed by Spark Streaming, but Offsets is saved by Zookeeper. Through parameter configuration, at-least once consumption can be realized, in which case it is possible to consume data repeatedly. Reduce resources. Direct does not need Receivers, and its applied Executors is all involved in the calculation task, while Receiver-based requires a special Receivers to read Kafka data and does not participate in the calculation. Therefore, with the same resource application, Direct can support larger business. Lower the memory. The Receiver of Receiver-based is asynchronous with other Exectuor and continuously receives data. For scenarios with small business volume, you need to increase the memory of Receiver if you encounter a large amount of business, but the Executor involved in computing does not need that much memory. Because Direct does not have Receiver, but reads data during calculation, and then calculates directly, so the requirement of memory is very low. Disadvantages: increase costs. Direct requires users to use checkpoint or third-party storage to maintain offsets, unlike Receiver-based, which uses ZooKeeper to maintain Offsets, which increases user development costs. Surveillance visualization. Receiver-based specifies topic specifies that the consumption of consumer can be monitored through ZooKeeper, while Direct does not have this convenience and cannot automatically save offset to zookeeper. If monitoring and visualization are achieved, human development is required.

Analysis of the advantages and disadvantages of Receiver: advantages: focus on computing. Kafka's high-level data reading method allows users to focus on the data they read without paying attention to or maintaining the offsets of consumer, which reduces the user's workload and code amount and is relatively simple. Disadvantages: prevent data loss. To perform checkpoint operations and configure spark.streaming.receiver.writeAheadLog.enable parameters and spark.streaming.receiver.writeAheadLog.enable parameters, you need to back up the logs in the batch to the checkpoint directory before each processing, which reduces the data processing efficiency, which in turn increases the pressure on the Receiver side. In addition, due to the data backup mechanism, it will be affected by the load, and the risk of delay will occur when the load is high, resulting in application crash. Single Receiver memory. Since receiver is also part of Executor, repeat consumption is needed to improve throughput. When the program fails to recover, it is possible that part of the data falls to the ground, but the program fails and the offset is not updated, which leads to repeated data consumption. Receiver and computing Executor are asynchronous, so due to network and other factors, the calculation delay, the calculation queue has been increasing, and Receiver has been receiving data, which is very easy to cause the program to crash. (2) Management of offset consumed by kafka checkpoint included with spark: spark streaming-enabled checkpoint is the easiest way to store offsets. Streaming checkpoint specifically saves the state of user applications, but the directory of checkpoint cannot be shared. Unable to restore across applications generally do not use checkpoint management offset to use zookeeper management offset if offset is not saved in Zookeeper, use the latest or oldest offset according to the configuration of kafkaParam, if there is a save offset in zookeeper, we will use this offset as the starting location of kafkaStream to use hbase to save offsetRowkey design: topic name + groupid + streaming batchtime.milliSeconds uses hdfs to manage offset: of course, this is not recommended because a large number of small files are generated in hdfs As a result, the performance of hdfs drops sharply (3) HA of Driver

introduction: when driver fails, he can restore the state of the current streamingContext object by reading the metadata in the checkpoint directory; it can detect the abnormal exit of the driver process and restart automatically.

The specific process of : when the program is run for the first time and there is no data in checkpoint, the StreamingContext object is created for the first time according to the defined function. When the program exits abnormally, a StreamingContext object is restored according to the metadata in checkpoint, reaching the state before the abnormal exit. To achieve abnormal exit and automatic start, the sparkStreaming application monitors the driver and perceives it when it fails, and restarts it.

requirements:

When -spark-submit submits the job, it must be in cluster mode and must be under spark-standalong.

Spark-submit\-- class com.aura.mazh.spark.streaming.kafka.SparkStreamDemo_Direct\ / / only the standalong mode of spark can be used here, so configure the spark cluster-- master spark://hadoop02:7077,hadoop04:7077\-- driver-memory 512m\-- total-executor-cores 3\-- executor-memory 512m\ # this code must be added, it can make the exception exit the driver program. Restart-- supervise\-- name SparkStreamDemo_Direct\-- jars / home/hadoop/lib/kafka_2.11-0.8.2.1.jar,\ / home/hadoop/lib/metrics-core-2.2.0.jar,\ / home/hadoop/lib/spark-streaming_2.11-2.3.2.jar,\ / home/hadoop/lib/spark-streaming-kafka-0-8. 2.11-2.3.2.jar \ / home/hadoop/lib/zkclient-0.3.jar\ / home/hadoop/original-spark-1.0-SNAPSHOT.jar\ spark://hadoop02:7077,hadoop04:7077

-you need to add-- supervise\ to achieve failed self-startup

-the checkpoint directory needs to be configured and stored on the hdfs, and the jar should be placed on the hdfs

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.