What is Spark Streaming? 07/03 Update SLTechnology News&Howtos

What is Spark Streaming?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is Spark Streaming". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is Spark Streaming".

One: Overview of Spark Streaming.

1.1 A simple understanding of Spark Streaming.

Spark Streaming is an extension of the core Spark API. It has the characteristics of scalability, high throughput, fault tolerance, real-time and so on.

Data intake from many sources, such as Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets.

You can also use complex algorithms with high-level features like map,reduce,join and window processing.

Finally, the processed data can also be pushed to the file system and database. In fact, we can also use Spark machine learning and graphics to process algorithms on data streams. It is graphically represented as follows:

Internally, it works as follows. Spark Streaming receives the real-time input data stream and data into batches, and then the final result stream generated by the Spark engine is batch processed. As shown in the figure:

In addition, Spark Streaming provides a high-level abstraction, called a discrete stream or DStream, that represents the data of a continuous stream. DStreams can be created from input data streams, such as Kafka, Flume, and Kinesis

Or an input data stream that uses other DStreams high-level operations.

Internally, DStream is represented by a sequence of RDDs.

First, look at Maven's dependent package (spark-streaming_2.10) management:

Org.apache.spark spark-streaming_2.10 1.6.1

1.2 eg: listens from a data server to count words in text data received by TCP sockets

Package com.berg.spark.test5.streaming;import java.util.Arrays;import org.apache.spark.SparkConf;import org.apache.spark.api.java.function.FlatMapFunction;import org.apache.spark.api.java.function.Function2;import org.apache.spark.api.java.function.PairFunction;import org.apache.spark.streaming.Durations;import org.apache.spark.streaming.api.java.JavaDStream;import org.apache.spark.streaming.api.java.JavaPairDStream;import org.apache.spark.streaming.api.java.JavaReceiverInputDStream Import org.apache.spark.streaming.api.java.JavaStreamingContext;import scala.Tuple2;public class SparkStreamingDemo1 {public static void main (String [] args) {SparkConf conf = new SparkConf (). SetMaster ("local [2]"). SetAppName ("NetworkWordCount"); conf.set ("spark.testing.memory", "269522560000"); JavaStreamingContext jssc = new JavaStreamingContext (conf, Durations.seconds (10)) System.out.println ("jssc:" + jssc); / / create a DStream that will connect to hostname:port, such as master:9999 JavaReceiverInputDStream lines = jssc.socketTextStream ("master", 9999); System.out.println ("lines:" + lines) JavaDStream words = lines.flatMap (new FlatMapFunction () {private static final long serialVersionUID = 1L; public Iterable call (String x) {return Arrays.asList (x.split ("));}}) / / Count each word in each batch JavaPairDStream pairs = words.mapToPair (new PairFunction () {public Tuple2 call (String s) {return new Tuple2 (s, 1);}}) JavaPairDStream wordCounts = pairs.reduceByKey (new Function2 () {public Integer call (Integer i1, Integer i2) {return i1 + i2;}}) / / Print the first ten elements of each RDD generated in this DStream to / / the console wordCounts.print (); jssc.start (); / / Start the computation jssc.awaitTermination (); / / Wait for the computation to terminate}}

As for how the code works, first enter it under Linux: $nc-lk 9999

Then run the code in Eclipse.

Enter a line of text words at random, separated by spaces, as follows:

Hadoop@master:~$ nc-lk 9999berg hello world berg hello

You can see the following results in the Eclipse console:

Time: 1465386060000 ms--- (hello,2) (berg,2) (world,1)

1.3 treat some of the file contents in the HDFS directory as input data streams.

Public class SparkStreamingDemo2 {public static void main (String [] args) {SparkConf conf = new SparkConf (). SetMaster ("local [2]"). SetAppName ("NetworkWordCount"); conf.set ("spark.testing.memory", "269522560000"); JavaStreamingContext jssc = new JavaStreamingContext (conf, Durations.seconds (10)); System.out.println ("jssc:" + jssc) / / create a DStream and read the files on the HDFS as the data source. JavaDStream lines = jssc.textFileStream ("hdfs://master:9000/txt/sparkstreaming/"); System.out.println ("lines:" + lines); / / Split each line into words JavaDStream words = lines.flatMap (new FlatMapFunction () {private static final long serialVersionUID = 1L) Public Iterable call (String x) {System.out.println (Arrays.asList (x.split (")) .get (0)); return Arrays.asList (x.split ("));}}) / / Count each word in each batch JavaPairDStream pairs = words.mapToPair (new PairFunction () {private static final long serialVersionUID = 1L; public Tuple2 call (String s) {return new Tuple2 (s, 1);}}) System.out.println (pairs); JavaPairDStream wordCounts = pairs.reduceByKey (new Function2 () {public Integer call (Integer i1, Integer i2) {return i1 + i2;}}) / / Print the first ten elements of each RDD generated in this DStream to the console wordCounts.print (); JavaDStream count = wordCounts.count (); count.print (); / / Statistical DStream dstream = wordCounts.dstream () Dstream.saveAsTextFiles ("hdfs://master:9000/bigdata/spark/xxxx", "sparkstreaming"); / / wordCounts.dstream (). SaveAsTextFiles ("hdfs://master:9000/bigdata/spark/xxxx", "sparkstreaming"); jssc.start (); jssc.awaitTermination (); / / Wait for the computation to terminate}}

What the above code completes is to listen all the time to see if there are files stored in the HDFS, that is, the hdfs://master:9000/txt/sparkstreaming/ directory, and if so, to count the words in the files.

After you try to run the program, and then manually add a file to the directory, you will see the data after counting the words in the contents of the file on the console.

Pay attention to the meaning of the parameters:

Public JavaDStream textFileStream (java.lang.String directory)

Create an input stream that monitors a Hadoop-compatible filesystem for

New files and reads them as text

Files (using key as LongWritable, value as Text and input format as TextInputFormat).

Files must be written to the monitored directory

By "moving" them from another location within the same file system.

File names starting with. Are ignored.

Parameters:

Directory-HDFS directory to monitor for new file

Returns:

(undocumented)

Thank you for your reading, the above is the content of "what is Spark Streaming", after the study of this article, I believe you have a deeper understanding of what Spark Streaming is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.