How to use receiver based Dstream 07/02 Update SLTechnology News&Howtos

How to use receiver based Dstream

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to use receiver based Dstream, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Matters needing attention

1. Like normal task, receiver is dispatched to executor by driver and occupies a cpu. Unlike normal task, receiver is a resident thread.

2. The number of receiver is determined by the number of KafkaUtils.createStream calls. One receiver is generated at a time.

3. Al topicMap = Map ("page_visits"-> 1) the value of map is actually the number of threads consumed.

Previously on: based on reciver kafka java client consumer high-level API

4. Receiver generates a block,spark.streaming.blockInterval by default 200ms. The default is 200ms. The minimum recommendation is 50ms, which is less than this value and the performance is not good. For example, the loading proportion of task is relatively large. If there are more than 50 tasks per second, it becomes a burden to load and distribute the tasks for execution.

Adjust the generation cycle of block according to the amount of data.

5. The block received by receiver will be put into blockmananger, and each executor will have an instance of blockmanager. Due to the locality of the data, the executor with recever will be scheduled to execute more task, which will cause some executor to be idle.

a)。 Add executor

b)。 Repartition adds partitions

c)。 Adjust the data locality spark.locality.wait if the tasks are all executed within 3 seconds, it will cause more and more tasks to be scheduled to be executed on the executor where the data exists, and eventually lead to the imbalance of the tasks executed by executor.

6. Kafka 082 high-end consumer api, there is the concept of grouping. Of course, there is a problem with the relationship between the number of threads in the consumer group and the number of kafka partitions.

7. The purpose of checkpoint is to recover or restore states such as upstatebykey from a driver failure.

8. Wal, pre-write log, for fault recovery, achieve at least one consumption. One is that there is no need for multiple copies, especially for hdfs-based storage. Then, for efficiency, you can turn off wal. Enable wal only needs to configure spark.streaming.receiver.writeAheadLog.enable as true, and the default value is false

9 limit the maximum rate of consumers

1. Spark.streaming.backpressure.enabled

The default is false, and if it is set to true, the back pressure mechanism is enabled.

2. Spark.streaming.backpressure.initialRate

It is not set by default, the initial rate. The maximum value that each receiver accepts the data when it is first started.

3. Spark.streaming.receiver.maxRate

The default value is not set. The maximum rate at which each receiver will receive data (records per second).

In fact, each stream will consume up to this number of records per second. Setting this configuration to 0 or a negative number will not limit the rate.

ten. Spark.streaming.stopGracefullyOnShutdown

The on yarn mode kill terminates the program immediately and is invalid.

11. When a job is generated, all the block of the current job valid range are assembled into a blockrdd, and a block corresponds to a partition.

Graphic illustration

Recevier-based dstream that does not join wal

Dstream that joins wal

The process of storing checkpoint and wal

Fault recovery diagram

The above is all the contents of this article "how to use receiver based Dstream". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.