Deployment and tuning of sparkStreaming programs 07/12 Update SLTechnology News&Howtos

Deployment and tuning of sparkStreaming programs

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

(1) deployment

Deployment method: spark standalone cluster, yarn cluster, meoss cluster.

HA for driver: if you want the driver program to restart automatically after a failure, you need to use the getOrcreate method in the program to reconstruct the streamingContext object and add parameters to the spark-submit.

Setting of the Checkpoint directory: if the program uses the checkpoint directory, it must configure a hdfs-compatible file system as the checkpoint directory, because the program is distributed and cannot set checkpoint separately on a node

How to receive data: Receiver and direct

Receiver mode: sufficient resources need to be allocated to executor, because the data accepted by receiver is stored in the memory of executor, especially when doing window operations, we must ensure that there is enough memory to store all the data in the corresponding time. The parameter spark.streaming.receiver.writeAheadLog.enable is set to true, which is used to enable WAL in receive mode to ensure that data is not lost.

direct mode: for kafka direct mode, the backpressure mechanism is introduced, so that there is no need to set spark.streaming.kafka.maxRatePerPartition,spark to automatically estimate the most reasonable receiving speed of receive, and dynamically adjust it according to the situation. You only need to set spark.streaming.backpressure.enabled to true.

(2) CPU resources with reasonable tuning settings in streaming programs, the use of CPU resources can be divided into two categories: receiving data and processing data. We need to set up enough CPU resources to make sufficient CPU resources for receiving data and processing data, so that data can be processed timely and efficiently. Performance optimization of data reception when data is received over the network, the data is deserialized and stored in spark memory. The parallel reception of data is to start multiple receiver and set multiple DStream input sources to adjust the block interval parameters. For most receiver, when the received data is saved, the data will be split into a block, and the number of block determines the number of partitions of each batch, while the number of partitions determines the number of task started by transformation: batch interval / block interval (spark.streaming.blockInterval, default is 200ms) (minimum is 50ms) Spark data processing parallelism tuning if the number of parallel task used in any stage is not enough, then the resources of the cluster can not be well utilized. You can use spark.default.parallelism to adjust the default number of parallel task, or you can manually specify the number of numPar and adjust the parallelism of task when calling the operator with shuffle. Task tuning of Spark

If too many task are started per second, the performance overhead of sending these task to the executor on the worker node will be high, and the latency will become high. Serialization of Task: use Kryo serialization mechanism to serialize task and reduce the size of task, thereby reducing the time execution mode sent to executor, running spark programs with spark's native standalone, you can achieve less tuning input data serialized by task startup time, and stored in executor memory when the data received by receiver Need to ensure the loss of 0 data to serialize the persistent RDD generated by streaming computing operations: persistent RDD generated by streaming computing operations, and data to be operated by windows need to be persisted batch interval tuning. For streaming computing, in order to make it stable and efficient, the most important thing is to deal with the batch as soon as possible after generation. When building StreamingContext, we need to pass in a parameter that sets the interval between Spark Streaming batches. Spark will submit the Job every batchDuration time. If your Job processing time exceeds the batchDuration setting, it will cause subsequent jobs not to be submitted on time. With the passage of time, more and more jobs will be delayed, resulting in the entire Streaming job being blocked, which indirectly leads to the inability to process data in real time, and eventually leads to program crash. Therefore, it is particularly important to set the batch processing time for your own business. Spark memory optimizes the persistence of DStream to persist a large amount of data into byte data, reducing the number of objects after data serialization and reducing the frequency of GC. Of course, in order to further reduce memory usage, you can use compression: spark.rdd.compress this is for true to clean up the old data, will be stored in memory has been used to delete data, free memory. Take the window operation as an example. If the window time is 10 minutes, the data will be kept in the spark for 10 minutes, and then the data will be cleared after the processing is completed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.