Case study of Spark Streaming 04/27 Update SLTechnology News&Howtos

Case study of Spark Streaming

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "case study of Spark Streaming". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and study the case study of Spark Streaming.

Why start with SparkStreaming?

Because SparkStreaming is a subframework on Spark Core, if we can fully master a subframework, we can better control Spark. SparkStreaming and Spark SQL are the most popular frameworks at present. From the research point of view, Spark SQL has too many problems related to SQL optimization, which is not suitable for in-depth research. Unlike other frameworks, SparkStreaming is more like an application of SparkCore. If we can learn more about SparkStreaming, we can write very complex applications.

The advantage of SparkStreaming is that it can combine SparkSQL, graph computing and machine learning, and its function is more powerful. In this era, simple flow computing has been unable to meet the needs of customers. SparkStreaming is also the most problematic in Spark because it is constantly running and is intrinsically complex.

The content of this lesson is:

1Magna SparkStreaming alternative online experiment

This alternative online experiment is reflected in the large setting of batchInterval, 5 minutes or more, in order to see more clearly the various environments in which Streaming is running.

The content of the experiment is to use SparkStreaming to count the number of words online, SparkStreaming to connect a port to receive the word data sent, and output the statistical information to the console, in which netcat is used to create a simple server to open and listen to a port to receive the word data entered by the user's keyboard.

2. Instantly understand the nature of SparkStreaming

Combined with this experiment and by observing the Job,Stage,Task and other information on Web UI, and then combined with the source code of SparkStreaming, the SparkStreaming is analyzed.

Lab environment description:

The experiment was run on three Ubuntu14.04 virtual machines, one as Master for Spark and the other two as Worker for Spark. The version of Spark used is currently the first version of 1.6.1 Magi Spark checkpoint stored on HDFS (version 2.6.0 of hadoop). In order to record the process information of SparkStreaming running, you need to start the HistoryServer of Spark. The following is the script to launch the Spark,HDFS,HistoryServer service.

The experimental code is as follows

The script submitted to the Spark cluster is as follows

First use nc-lk 9999 on the Master node, create a simple Server, and then run the script to submit the Spark Application.

Find an English article from the Internet, as follows

The word statistics are as follows

Four Job were observed on Spark UI.

First check Job 0 and find that SparkStreaming will submit a Job when it is started.

The start method of JobScheduler is called in the start method of StreamingContext

Continue to look at the start method of the JobScheduler class

The purpose of this method is to ensure that each Slave is registered, to prevent all Receiver from being on the same node, and to calculate load balancing later.

Job1 has been running because it continues to receive data from the data stream, runs on Worker1 and runs a Task to receive data, and the data locality is that PROCESS_LOCAL,receiver receives data and saves it to memory.

The information for Job2 is as follows

The Stage3 information is as follows

The Stage4 information is as follows

The information for Job3 is as follows. Job3's DAG diagram is the same as Job2's DAG diagram, but Stage5 skips it.

The following Job is all about word segmentation.

Reviewing the four Job, two of the Job are run by the framework, and Job0 ensures that all slave are registered to prevent all Receiver from being on the same node for later calculation of load balancing. In order to start a data receiver, Job1 runs on a Task on an Executor, constantly receives a large amount of data, and then saves it to memory. Job2 and Job3 are running word segmentation statistics.

Thank you for your reading, the above is the content of "case study of Spark Streaming". After the study of this article, I believe you have a deeper understanding of the case study of Spark Streaming, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.