What is the initial use and working principle of Spark Streaming 07/03 Update SLTechnology News&Howtos

What is the initial use and working principle of Spark Streaming

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

It is believed that many inexperienced people have no idea about the initial use and working principle of Spark Streaming. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

I. streaming Computing

1. What is stream?

Streaming: is a data transfer technology that turns the data received by the client into a stable and continuous

Stream, sent continuously, so that the sound heard by the user or the image seen is very smooth, and the user is in

You can start browsing the file on the screen before the whole file is sent.

two。 Common streaming Computing Framework

Apache Storm

Spark Streaming

Apache Samza

The above three real-time computing systems are all open source distributed systems with low latency, scalability and fault tolerance.

Many advantages, their common feature is that they allow you to assign tasks to the

Run in parallel on a series of fault-tolerant computers. In addition, they all provide a simple API to

Simplify the complexity of the underlying implementation.

For the comparison of the above three flows, you can refer to the three frameworks handled by big data in this article: Storm,Spark and Samza

II. Spark Streaming

1.Spark Streaming introduction

Spark Streaming is an important framework in the Spark ecosystem, which is based on Spark Core. The following picture also shows the position of Sparking Streaming in the Spark ecosystem.

The official explanation for Spark Streaming is as follows:

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark's machine learning and graph processing algorithms on data streams.

Spark Streaming has the following characteristics:

Highly scalable and can run on hundreds of machines (Scales to hundreds of nodes)

Low latency and data processing at the second level (Achieves low latency)

High fault tolerance (Efficiently recover from failures)

Ability to integrate parallel computing programs, such as Spark Core (Integrates with batch and interactive processing)

How 2.Spark Streaming works

For Spark Core, its core is RDD, for Spark Streaming, its core is DStream,DStream similar to RDD, it is essentially a collection of RDD, DStream can divide the data stream in batches according to the number of seconds. First, after receiving the stream data, divide it into multiple batch, then submit it to the Spark cluster for calculation, and finally output the results to HDFS or database, front-end page display and so on. You can refer to the following picture to help understand:

We all know that during initialization, Spark Core will generate a SparkContext object for subsequent processing of the data, and the corresponding Spark Streaming will create a Streaming Context, whose underlying layer is SparkContext, that is, it will submit the task to SparkContext for execution, which is a good explanation that DStream is a series of RDD. When starting a Spark Streaming application, a Receiver recipient is first launched on the Executor of a node, and then it is received by the Receiver when the data is written from the data source. After receiving the data, the Receiver will Split the data into a number of block, and then back up to each node (Replicate Blocks disaster recovery). Then the Receiver reports the block to the StreamingContext, indicating that the data is on the Executor of those nodes. Then, at a certain interval, StreamingContext will process the data into RDD and give it to SparkContext to divide it to each node for parallel computing.

3.Spark Streaming Demo

After introducing the basic principles of Spark Streaming, let's take a look at how to run Spark Streaming. An official example is a case of collecting data from Socket source monitoring and running wordcount. The case is very simple and will not be explained here. Readers can refer to the official document [http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html].

There are two ways to deal with the Spark Streaming programming model

The first is to create a SparkStreaming through SparkConf

Import org.apache.spark._import org.apache.spark.streaming._import org.apache.spark.streaming.StreamingContext._ val conf=new SparkConf (). SetAppName ("SparkStreamingDemo"). SetMaster ("master") val scc=new StreamingContext (conf,Seconds (1)) / / detect data every 1 second

one

two

three

four

five

The second: create it through SparkContext, that is, run it on the Spark-Shell command line:

Import org.apache.spark.streaming._val scc=new StreamingContext (sc,Seconds (1))

one

two

Of course, we can also collect data from the HDFS file system and look up the source code of Spark. We can find the following methods:

This method monitors the data in the specified HDFS file directory, but ignores "." The beginning of the file, that is, will not be collected with "." The beginning of the file for data processing.

The following is a case of how to run a wordcount case from the monitoring data on the HDFS file system to count the words and print the results:

Import org.apache.spark._import org.apache.spark.streaming._import org.apache.spark.streaming.StreamingContext._ val ssc = new StreamingContext (sc, Seconds (5)) / / read dataval lines = ssc.textFileStream ("hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/spark/streaming/") / / processval words = lines.flatMap (_ .split (") val pairs = words.map (word = > (word) 1)) val wordCounts = pairs.reduceByKey (_ + _) wordCounts.print () ssc.start () / / Start the computationssc.awaitTermination () / / Wait for the computation to terminate

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

fifteen

sixteen

seventeen

The above program will check every 5 seconds to see if there is any new data in the hdfs://hadoop-senior.shinelon.com:8020/user/shinelon/spark/streaming/ directory under the HDFS file system, count it if so, and then print the results on the console. There are two ways to run the above code. After running the Spark-shell client, you can paste the above commands one by one to the command line for execution, which is obviously troublesome. The second way is to write the above program into a script file and load it into the Spark-shell command line for execution. Here is the second way:

Create the SparkStreamingDemo.scala file in a directory, as shown in the code above. Then start the Spark-shell client.

$bin/spark-shell-master local [2]

one

Then load the Spark Streaming application:

Scala >: load / opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh6.3.6/SparkStreamingDemo.scala

one

Then upload the data to the above HDFS file directory:

$bin/hdfs dfs-put / opt/datas/wc.input / user/shinelon/spark/streaming/input7

one

The contents of the file are as follows:

Hadoop hivehadoop hbasehadoop yarnhadoop hdfshdfs spark

one

two

three

four

five

The running result is as follows:

Generally speaking, the writing of an Spark Streaming application is divided into the following steps:

Define an input stream source, such as getting data on the HDFS,kafka side, data in sockets, etc.

Define a series of processing conversion operations, such as the above map,reduce operation, etc. Spark Streaming also has transformation operations similar to SparkCore

Start the program to collect data (start ())

Wait for the program to stop (terminate with an error or stop awaitTermination manually ())

Manually terminate the application (stop ())

You can use the saveAsTextFiles () method to output the results to the HDFS file system, and the reader can experiment with saving the results to the HDFS file system.

Finally, let's look at several common ways of developing Spark Streaming applications:

Spark Shell Code: development, testing (as mentioned above, paste strips of code into the command line for execution, which applies only to testing)

Spark Shell Load Scripts: development, testing (writing scala scripts to be executed in spark-shell)

IDE Develop App: develop, test, package JAR (production environment), spark-submit submit applications

After reading the above, have you mastered the initial use of Spark Streaming and how it works? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.