How to understand Spark Streaming implementation 07/12 Update SLTechnology News&Howtos

How to understand Spark Streaming implementation

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to understand the implementation of Spark Streaming, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

To say that streaming microbatching is similar to Spark Streaming, you have to talk about TCP streaming. Typical tcp IO flow models are bio, pseudo-asynchronous IO,NIO,AIO,Rector model and so on. We are mainly talking about pseudo-asynchronous IO here.

The following wave tip leads the transformation of it into spark Streaming's SocketStream step by step.

In pseudo-asynchronous mode, we link the client to the server through TCP. This is not feasible in distributed mode, and for microbatch processing of Spark Streaming, we have no idea where Receiver is running. So, the client link doesn't know where the request goes, and of course, we can do a complex operation to report the location of our Receiver. So, the first step is to change our back end to the client side of TCP, and then the client is actively linked to the external data center, that is, the server side, to pull or be push data.

Then, after the modification in the previous step, our model may become the following mode:

In other words, client initiatively goes to data server to establish a connection request, and then starts to receive data, receiving a certain number of data, such as 1000 (also with a timeout mechanism), and then encapsulates it into task and throws it into the thread pool for execution.

Of course, we can further improve it, for example, a thread is responsible for receiving data, and then caching it to map or Array, and we are starting a RecurringTimer, that is, a timing thread, every millisecond, such as 200ms, encapsulating the data in map or Array into a data block called block, which is stored in a memory Array. Then consume the block in the Array with a background thread blocking and store the block in a data manager, such as blockmanager. At this time, we use a RecurringTimer to generate a description of the data to be processed by the task itself in the task,task at regular intervals, such as batch=5s, and then put it into the thread pool to execute. During execution, the block is fetched according to the description information of the data and processed.

In fact, the above steps are very similar to the Receiver-based model of spark Streaming.

Spark Streaming must complete the scheduling start of receiver before executing the task, which is similar to the job scheduling execution of spark core. So in receiver mode, when assigning core to executor, you should also consider that receiver will occupy a cpu.

When receiver is started, it functions like the pseudo-asynchronous IO model mentioned earlier. Receiver completes the data collection cache, and then the timer thread completes the process of generating block and storing it in blockmanager.

In addition to block data generation, Spark Streaming also has a spark core task generation process. For spark Streaming, when generating job, it is actually encapsulated into an object called BlockRDD based on the data block information of the current batch (due to the existence of the window, which may also be several times of the batch). The number of partitions of blockrdd is the number of block, and then the calculation operation can be performed according to the calculation method of our Spark core. The task generated in each partition fetches the corresponding block according to its corresponding block. In fact, for BlockRDD, each block corresponds to a partition.

Of course, some people should ask, isn't it possible for spark Streaming not to be based on receiver? what's the other way?

Before we discuss this issue, let's talk about another problem, that is, some data sources, such as kafka, have the concept of partition, and we can use offset to obtain data flexibly, that is, we can request the data we want by controlling the request offset. For this kind of data source, there is no need to retrieve the data and store it in blockmanager, and then take it out of blockmanager and then process it (please note that the pre-written log is ignored for the time being), which is obviously a waste of performance. This kind of network IO stream, because the data can not be stored and locally like kafka, and then take the data at will, it can only be saved before processing. In fact, based on the form of receiver, is the largest number of Spark streaming scenarios.

A pattern has emerged for message queues such as kafka, and that is the direct pattern. That is, we do not use Receiver, generate block, and then build blockRDD. Each Block is treated as a partition;, but when generating job, we build an object called KafkaRDD based on offset information. The concept of partition in kafkaRDD corresponds to the internal topic partition of kafka one by one. Then, execute the job of spark core, calculate the task generated by each partition, and go to the kafka to get the specific data according to the internal information of the KafkaRDD.

You can see that there is less direct, Receiver-related content, no pre-written log, no need for data to land back and forth, and so on. The performance has been greatly improved.

This is the answer to the question on how to understand the implementation of Spark Streaming. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.