What is direct Dstream? 10/22 Update SLTechnology News&Howtos

What is direct Dstream?

2025-10-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "what is direct Dstream", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn this article "what is direct Dstream?"

Preface

In fact, people are not interested in reading when they see receiver based Dstream. In fact, this is wrong. Receiver-based is the root of spark streaming, although direct stream is more appropriate. However, we can learn a lot from receiver-based, the most important principles of spark streaming implementation, data locality and so on.

Direct dstream operational architecture diagram

Contrast

Compare Dstream and direct Dstream of receiver based

A there is no need to start receiver to reduce unnecessary cpu usage

B reduces the process of receiver receiving data, writing to blockmanager, and then running through blockid, network transmission, disk read area, to obtain data. It improves the efficiency.

C does not need wal to further reduce disk read and write.

D precise one-time consumption can be achieved by manually maintaining offset.

The RDD generated in e-Dstream is not blockrdd, but kafkardd,kafkardd corresponds to kafka partition one by one, which makes it easier for us to control parallelism.

F data locality problems cause machines that exist in receiver to run too many tasks, resulting in some executor idle.

And kafkardd, in the compute function, will use simpleconsumer to read data according to the specified topic, partition, offset range, and go to kafka. After version 010, there is a concept of data nature if kafka and spark are running in the same cluster.

Data locality

The rdd generated by the combination of spark streaming and kafka 082The data locality is calculated as follows:

Override def getPreferredLocations (thePart: Partition): Seq [String] = {

Val part = thePart.asInstanceOf [KafkaRDDPartition]

/ / TODO is additional hostname resolution necessary here

Seq (part.host)

}

The rdd generated by the combination of spark streaming and kafka 010.The data locality is calculated as follows:

Override def getPreferredLocations (thePart: Partition): Seq [String] = {

/ / The intention is best-effort consistent executor for a given topicpartition

/ / so that caching consumers can be effective.

/ / TODO what about hosts specified by ip vs name

Val part = thePart.asInstanceOf [KafkaRDDPartition]

Val allExecs = executors ()

Val tp = part.topicPartition

Val prefHost = preferredHosts.get (tp)

Val prefExecs = if (null = = prefHost) allExecs else allExecs.filter (_ .host = = prefHost)

Val execs = if (prefExecs.isEmpty) allExecs else prefExecs

If (execs.isEmpty) {

Seq.empty

} else {

/ / execs is sorted, tp.hashCode depends only on topic and partition, so consistent index

Val index = Math.floorMod (tp.hashCode, execs.length)

Val chosen = execs (index)

Seq (chosen.toString)

}

As for the matters needing attention in combination with kafka010, Langjian has actually translated an article before.

Must read: integration of Spark and kafka010

Speed limit

Speed limit, many people use the wrong posture, the detailed principle can be seen

Appreciation of PIDController source code of Spark and detailed explanation of backpressure

For more information on configuration parameters, please see:

Spark.streaming.backpressure.enabled defaults to false and is set to true to enable the back pressure mechanism.

Spark.streaming.backpressure.initialRate is not set by default, the initial rate. The maximum value that each receiver accepts the data when it is first started.

The spark.streaming.receiver.maxRate default value is not set. The maximum rate at which each receiver will receive data (records per second). In fact, each stream will consume up to this number of records per second. Setting this configuration to 0 or a negative number will not limit the rate.

The maximum rate (records per second) at which data is read from each Kafka partition when spark.streaming.kafka.maxRatePerPartition uses the new Kafka direct API.

The above is all the content of this article "what is direct Dstream?" thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.