Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the reverse pressure mechanism of Spark Streaming?

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article shows you how the Spark Streaming reverse pressure mechanism is, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Background

By default, Spark Streaming receives data at the rate of producer production data through receivers (or Direct mode). When batch processing time > batch interval, that is, the processing time of each batch is longer than that of Spark Streaming batch processing; more and more data are received, but the data processing speed does not keep up, resulting in data accumulation in the system, which may further lead to OOM problems and failures on the Executor side.

Before Spark version 1.5, in order to solve this problem, for Receiver-based data receiver, we can configure the spark.streaming.receiver.maxRate parameter to limit the maximum number of records each receiver can receive per second; for Direct Approach data reception, we can configure the spark.streaming.kafka.maxRatePerPartition parameter to limit the maximum number of records read per Kafka partition per job. Although this method can adapt to the current processing capacity by limiting the receiving rate, it has the following problems:

We need to estimate the processing speed of the cluster and the speed of generating message data in advance.

These two methods require human participation. After modifying the relevant parameters, we need to restart the Spark Streaming application manually.

If the processing capacity of the current cluster is higher than that of the maxRate configured by us, and the data generated by producer is higher than that of maxRate, this will lead to low utilization of cluster resources and the data cannot be processed in a timely manner.

Reverse pressure mechanism

So is it possible that the Spark Streaming system automatically deals with these problems without human intervention? Of course there is! Spark 1.5 introduces the reverse pressure (Back Pressure) mechanism, which automatically adapts the cluster data processing capacity by dynamically collecting some data from the system. For a detailed record, see the instructions in SPARK-7398.

Architecture prior to Spark Streaming 1.5

Prior to Spark 1.5, the architecture of Spark Streaming was as follows:

Data is received continuously through receiver, and when it is received, it stores the data in Block Manager; in order not to lose the data, it also backs up the data to other Block Manager

Receiver Tracker receives the stored Block IDs, and then maintains a time-to-block IDs relationship internally

Job Generator receives an event every batchInterval, which generates a JobSet

Job Scheduler runs the JobSet generated above.

Architecture after Spark Streaming 1.5

In order to automatically adjust the data transmission rate, a component named RateController is added to the original architecture, which inherits from StreamingListener, which listens for onBatchCompleted events of all jobs, and estimates a rate based on processingDelay, schedulingDelay, the number of records currently processed by Batch and completed events; this rate is mainly used to update the maximum number of records that can be processed by the stream per second. The rate estimator (RateEstimator) can be implemented in many ways, but the current Spark 2.2 only implements a rate estimator based on PID.

The calculated maximum rate is stored in the RateController within the InputDStreams, and the calculated rate is pushed to the ReceiverSupervisorImpl after the onBatchCompleted event is processed, so that the receiver knows how much data to receive next.

If the user is configured with spark.streaming.receiver.maxRate or spark.streaming.kafka.maxRatePerPartition, the final amount of data received depends on the minimum of the three. This means that the data processed per second by each receiver or Kafka partition does not exceed the value of spark.streaming.receiver.maxRate or spark.streaming.kafka.maxRatePerPartition.

The detailed process is shown in the following figure:

The use of Spark Streaming reverse pressure mechanism

Enabling the reverse compression mechanism in Spark is simple, just set spark.streaming.backpressure.enabled to true, and the default value for this parameter is false. The reverse pressure mechanism also involves the following parameters, including those not listed in the document:

Spark.streaming.backpressure.initialRate: the initial maximum rate at which each receiver receives the first batch of data when the reverse pressure mechanism is enabled. The default value is not set.

Spark.streaming.backpressure.rateEstimator: speed estimator class. The default value is pid. Currently, Spark only supports this. You can implement it according to your own needs.

Spark.streaming.backpressure.pid.proportional: weight used to respond to errors (changes between the last batch and the current batch). The default value is 1 and can only be set to a non-negative value. Weight for response to "error" (change between last batch and this batch)

Spark.streaming.backpressure.pid.integral: response weight accumulated by errors, with inhibitory effect (effective damping). The default value is 0.2 and can only be set to a non-negative value. Weight for the response to the accumulation of error. This has a dampening effect.

Spark.streaming.backpressure.pid.derived: response weight to error trends. This may cause fluctuations in batch size and can help quickly increase / decrease capacity. The default value is 0 and can only be set to a non-negative value. Weight for the response to the trend in error. This can cause arbitrary/noise-induced fluctuations in batch size, but can also help react quickly to increased/reduced capacity.

Spark.streaming.backpressure.pid.minRate: what is the lowest rate that can be estimated? The default value is 100 and can only be set to a non-negative value.

The above is what the reverse pressure mechanism of Spark Streaming is like. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report