Big data's Spark Streaming of Analytical Technology and actual Combat 07/08 Update SLTechnology News&Howtos

Big data's Spark Streaming of Analytical Technology and actual Combat

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Spark is a memory-based big data integrated processing engine, which has excellent job scheduling mechanism and fast distributed computing capability, which makes it more efficient for iterative computing, so Spark can realize big data's streaming processing to some extent.

With the rapid development of information technology, the amount of data shows an explosive growth trend, and the type and changing speed of data are far beyond people's imagination, so people put forward higher requirements for big data processing. Big data technology is urgently needed in more and more fields to solve the key problems in the field. In some specific areas (such as finance, disaster warning, etc.), time is money, time may be life! However, the traditional batch processing framework has been difficult to meet the real-time requirements in these fields. For this reason, a number of streaming computing frameworks such as S4 and Storm have emerged. Spark is a memory-based big data integrated processing engine, which has excellent job scheduling mechanism and fast distributed computing capability, which makes it more efficient for iterative computing, so Spark can realize big data's streaming processing to some extent.

Spark Streaming is a streaming framework on Spark, which can realize real-time computing with high throughput and high fault tolerance for massive data. Spark Streaming supports many types of data sources, including Kafka, Flume, trwitter, zeroMQ, Kinesis, and TCP sockets, as shown in figure 1. Spark Streaming receives the data stream in real time, and divides the continuous data stream into batches of discrete data sets according to a certain time interval, and then uses rich API such as map, reducluce, join and window for complex data processing. Finally, it is submitted to the Spark engine for operation to get batch result data, so it is also called quasi-real-time processing system.

Figure 1 Spark Streaming supports multiple types of data sources

At present, the most widely used × × processing framework is Storm. Spark Streaming is not as good as Storm in real-time and fault tolerance because it can be processed at least 0.5 seconds per second (while Storm can reach 0.1 seconds as fast as possible). However, Spark Streaming is very integrated. Through RDD, it can not only seamlessly connect and share data with all components on Spark, but also easily integrate with distributed log collection frameworks such as Kafka and Flume; at the same time, the throughput of Spark Streaming is very high, which is much better than that of Storm, as shown in figure 2. Therefore, although the processing latency of Spark Streaming is higher than that of Storm, its advantages in integration and throughput make it more suitable for big data background.

Basic concepts of Spark Streaming

Batch processing interval

In Spark Streaming, the data collection is real-time and one by one, but the data processing is carried out in batches. Therefore, Spark Streaming needs to set a time interval to unify the data collected during this interval, which is called batch processing interval.

In other words, for the continuous flow of data, Spark Streaming first discretizes the continuous data stream by means of segmentation. Each time the data stream is segmented, a RDD is generated, and each RDD contains all the data obtained in a time interval, so the data flow is transformed into an ordered set of several RDD, and the batch processing interval determines how often the Spark Streaming splits the data stream. Spark Streaming is a component on Spark, and the acquired data and operations on it are still calculated in the underlying Spark kernel in the form of Spark jobs, so the batch processing interval not only affects the throughput of data processing, but also determines the frequency of Spark Streaming submitting jobs to Spark and the delay of data processing. It should be noted that the setting of the batch interval will accompany the whole life cycle of the Spark Streaming application and cannot be dynamically modified during the program running, so it is necessary to comprehensively consider many factors such as the characteristics of the data flow in the actual application scenario and the processing performance of the cluster.

Window interval

Window interval, also known as window length, is an abstract time concept, which determines the scope and granularity of RDD sequence processing by Spark Streaming, that is, users can count and analyze the data in a certain time range by setting the window length. If the batch processing time is set to 1s and the window interval is 3s, as shown in figure 3, where each solid rectangle represents a RDD cut out by Spark Streaming every 1 second, several solid rectangular blocks represent a time sequence of RDD, and a transparent rectangle represents the window interval. The maximum number of RDD in the easy-to-know window is 3, that is, Spark Streming can count and analyze the data in 3 RDD at a time. For window intervals, you also need to pay attention to the following points:

Taking figure 3 as an example, in the first 3 seconds after the system starts, there are less than 3 RDD entering the window, but as time goes by, the final window will be filled.

The RDD contained in different windows may overlap, that is, the data in the current window may be contained by several subsequent windows, so in some application scenarios, the data that has been processed cannot be deleted immediately for subsequent calculation.

The window interval must be an integral multiple of the batch interval.

Fig. 3 schematic diagram of window interval

Sliding time interval

The sliding time interval determines the frequency of statistics and analysis of data by Spark Streaming, which often occurs in window-related operations. The sliding interval is based on the batch interval, which must be an integral multiple of the batch interval. By default, the sliding interval is set to the same value as the batch interval. If the batch interval is 1s, the window interval is 3s, and the sliding interval is 2s, as shown in figure 4, it means to statistically analyze the three RDD generated in the past 3s every 2s.

Fig. 4 Comprehensive schematic diagram of sliding interval, window interval and batch interval

Basic concepts of DStream

DStream is a basic abstraction of Spark Streaming, which approximately describes continuous data streams in the form of discrete RDD sequences. DStream is essentially a hash table with time as the key and RDD as the value, which stores the RDD generated in chronological order, and each RDD encapsulates the data obtained during the batch processing interval. Each time Spark Streaming adds the newly generated RDD to the hash table, and for RDD that is no longer needed, it is removed from the hash table, so DStream can also be simply understood as a dynamic sequence of time-keyed RDD. Let the batch processing interval be 1s, and figure 5 is a schematic diagram of DStream generated within 4s.

Spark Streaming programming Mode and case study

Spark Streaming programming mode

Let's take the official WordCount code provided by Spark Streaming as an example to introduce the use of Spark Streaming.

Example 1:

The functional structure of a Spark Streaming application usually consists of the following five parts, as shown in example 1 above.

Import Spark Streaming related packages: Spark Streaming, as a component on the Spark framework, has a good integration. When developing Spark Streaming applications, you only need to import Spark Streaming-related packages without additional parameter configuration.

Create the StreamingContext object: like the SparkContext object in the Spark application, the StreamingContext object is the only channel for the Spark Streaming application to interact with the cluster, which encapsulates the environment information of the Spark cluster and some attribute information of the application. In this object, you usually need to specify the running mode of the application (set to local [2] in example 1), set the application name (set to NetworkWordCount in example 1), and set the batch interval (set to Seconds (1) in example 1), that is, 1 second), where the batch interval needs to be set appropriately according to the needs of the user and the processing power of the cluster.

To create an InputDStream:Spark Streaming, you need to choose the appropriate method to create the DStream according to the type of data source. In example 1, Spark Streaming calls the socketTextStream method through the StreamingContext object to process the socket connection to the type data source, creating a DStream, or lines. Spark Streaming also supports many different types of data sources, including Kafka, Flume, HDFS/S3, Kinesis, and Twitter.

Operation DStream: for the DStream obtained from the data source, the user can call a wealth of operations to process it. The series of operations against lines in example 1 is a typical WordCount execution process: the text data in the current batch interval is segmented with spaces, and then the words; is obtained, and then each word in the words is converted into a binary, and then the pairs; is finally counted using the reduceByKey method.

Start and stop the Spark Streaming application: before starting the Spark Streaming application, all the operations on the DStream only define the processing flow of the data. The program does not really connect to the data source, nor does it do anything about the data. When ssc.start () starts, the operations defined in the program will really begin.

A case study of text file data processing

Functional requirements

Monitor and obtain the newly generated files in the local home/dong/Streamingtext directory in real time (the files are all English text files, and the words are separated by spaces), and count the number of words in the file.

Code implementation

Run the demo

Step 1, start Hadoop and Spark.

Step 2, create the Streaming monitoring directory.

Create a directory under the dong user home directory where Streamingtext is monitored by Spark Streaming, as shown in figure 6.

Figure 6. Create a Streamingtext folder under the dong user's home directory

Step 3, edit and run the Streaming program in IntelliJ IDEA. Create the project StreamingFileWordCount in IntelliJ IDEA and edit the object StreamingFileWordCount, as shown in figure 7.

Fig. 7 schematic diagram of StreamingFileWordCount in IntelliJ IDEA

Since there are no input parameters in this example, no configuration parameters are required, you can right-click-> click "Run'StreamingFileWordCount'".

Step 4, create a text file under the listening directory. Create file1.txt and file2.txt in / home/dong/Streamingtext on the master node, respectively.

The file1.txt content is as follows:

The file2.txt content is as follows:

After creation, the content in / home/dong/Streamingtext is shown in figure 8.

Fig. 8 schematic diagram of Streamingtext folder contents

View the result

The terminal window outputs the number of words contained in the newly generated file in / home/dong/Streamingtext during each batch interval (20 seconds), as shown in figure 9.

Fig. 9 schematic diagram of StreamingFileWordCount running result

Network data processing case

Functional requirements

Monitor the data flow transmitted by the designated port of the local node (in this case, the English text data of port 9999 of the master node, separated by commas), and count the number of words collected during the interval every 5 seconds.

Code implementation

This case involves two parts: data flow simulator and analyzer. In order to get closer to the real network environment, we first define a data flow simulator, which listens for the specified port number (port 9999 of the master node) on a specified node in the network in Socket mode, when an external program connects through this port and requests data The data flow simulator periodically sends randomly selected data from the specified text file to the designated port (every 1 second, the data flow simulator randomly intercepts a line of text from the / home/dong/Streamingtext/file1.txt on the master node to port 9999 of the master node), thus simulating the continuous flow of data in the network environment. For the real-time data obtained, an analyzer (Spark Streaming application) is defined to count the number of words collected during the interval (5 seconds).

The code implementation of the data flow simulator is as follows:

The parser code is as follows:

Run the demo

Step 1, edit and run the Streaming program in IntelliJ IDEA. The master node starts the IntelliJ IDEA, creates the project NetworkWordCount, and edits the simulator and analyzer. The simulator is shown in figure 10 and the analyzer is shown in figure 11.

Fig. 10 schematic diagram of data flow simulator in IntelliJ IDEA

Figure 11 schematic diagram of the analyzer in IntelliJ IDEA

Step 2, create the simulator data source file. Create the / home/dong/Streamingtext directory on the master node, where you create the text file file1.txt.

The file1.txt content is as follows:

Step 3, package the data flow simulator. The packing process is detailed in Section 4.3.3 of this book. In the Artifacts packaging configuration interface, according to the user's actual scala installation directory, add the following scala dependency package to Class Path, as shown in figure 12.

Figure 12 adding scala dependency packages to Class Path

After packaging, NetworkWordCount.jar is generated in the home directory, as shown in figure 13.

Figure 13 generate a NetworkWordCount.jar diagram under the dong user's home directory

Step 4, start the data flow simulator. Open the control terminal in the master node and start the data flow simulator with the following code.

The data flow simulator randomly intercepts a line of text from / home/dong/Streamingtext/file1.txt and sends it to port 9999 of the master node every 1000 milliseconds. When the analyzer is not connected, the data flow simulator is in a blocking state and the terminal does not display the output text.

Step 5, run the analyzer. Start IntelliJ IDEA on master to write the parser code, and then click the menu "Build- >" Build Artifacts to configure the parameters needed for the parser to run through the Application option, where the Socket hostname is master, the port number is 9999, and the parameters are separated by spaces, as shown in figure 13.

Fig. 13 schematic diagram of parser parameter configuration

After configuring the parameters, return to the IntelliJ IDEA menu bar and click "Run"-> "Build Artifacts" to run the analyzer.

View the result

Step 1, check the operation of the data flow simulator on master. IntelliJ IDEA runs the analyzer to establish a connection with the data flow simulator. When an external connection is detected, the data flow simulator randomly intercepts a line of text from / home/dong/Streamingtext/file1.txt every 1000 milliseconds and sends it to port 9999 on the master node. For ease of explanation and illustration, each line in file1.txt contains only one word, so the data flow simulator sends only one word to the port at a time, as shown in figure 14.

Figure 14 results of running the simulator on master

Step 2, check the operation of the parser in the IntelliJ IDEA of master. In the running log window of IntelliJ IDEA, you can observe the statistical results. Through the analysis, we can know that the number of words obtained by Spark Streaming in each batch interval is 5, which is exactly the total number of words sent in 5 seconds, and the words are counted, as shown in figure 15.

Stateful application case

In many practical application scenarios related to data streams, the statistical analysis of current data needs to be completed with the help of previous data processing results. For example, e-commerce statistics the total current cumulative sales of a commodity every 10 minutes, the station statistics the current total passenger flow every 3 hours, and so on. This kind of application requirements can be realized with the help of Spark Streaming stateful transition operation.

Functional requirements

Listen to the data flow transmitted on a designated port on a node in the network (the English text data of port 9999 of the slave1 node, separated by commas), and count the cumulative occurrence times of each word every 5 seconds.

Code implementation

The realization of the function of this case involves two parts: data flow simulator and analyzer.

Analyzer code:

Run the demo

Step 1, the slave1 node starts the data flow simulator.

Step 2, package the analyzer. The master node launches the IntelliJ IDEA create project StatefulWordCount editing parser, as shown in figure 16, and packages the parser directly to the home directory of the master node dong user, as shown in figure 17.

Fig. 16 schematic diagram of StatefulWordCount in IntelliJ IDEA

Figure 17 schematic diagram of StatefulWordCount.jar on master

Step 3, run the analyzer. Open the terminal on the master node and submit the application to the Spark cluster with the following code.

View the result

Step 1, check the operation of the data flow simulator on slave1. The analyzer runs on the cluster and establishes a connection with the data flow simulator running on the slave1. When an external connection is detected, the data flow simulator randomly intercepts a line of text from / home/dong/Streamingtext/file1.txt every 1000 milliseconds and sends it to port 9999 on the slave1 node. Because each line in the text file contains only one word, only one word is sent to the port per second. As shown in figure 18.

Fig. 18 schematic diagram of data flow simulator on slave1

Step 2, check the operation of the parser on master. The statistical results can be viewed in the submission window of the master node, as shown in figure 19.

Figure 19 schematic diagram of the operation of the analyzer on master

The figure shows that a total of 14 words have been received by the 147920770500ms analyzer, of which "spark" appears 3 times, "hbase" 5 times, "hello" 3 times, and "world" 3 times. Because the batch processing interval is 5s and the simulator sends one word per second, the analyzer receives a total of 5 words in 5s. So up to 147920771000ms, the analyzer received a total of 19 words, of which "spark" appeared 5 times, "hbase" appeared 7 times, "hello" appeared 4 times, and "world" appeared 3 times.

Step 3, look at the persistence directory in HDFS. After running, look at the persistence directory / user/dong/input/StatefulWordCountlog on HDFS, as shown in figure 20. Streaming applications persist the received network data to this directory for fault-tolerant processing.

Figure 20 schematic diagram of persistent directory on HDFS

Window application case

In the actual production environment, window-related application scenarios are very common, such as e-commerce counting the total sales of a commodity in the first 30 minutes every 10 minutes, passenger flow in the first 3 hours at the station every 1 hour, and so on. This kind of demand can be realized by window-related operations in Spark Streaming. The window application case also involves batch interval, window interval and sliding interval.

Functional requirements

Listen to the data flow transmitted on a designated port on a node in the network (English text data at port 9999 on the slave1 node, separated by commas), and count the cumulative number of words appearing in the first 30 seconds every 10 seconds.

Code implementation

The implementation of this example involves two parts: the data flow simulator and the analyzer.

Analyzer code:

To run the demo step 1, the slave1 node starts the data flow simulator. Step 2, package the analyzer. Start IntelliJ IDEA to create a project WindowWordCount editing parser on the master node, as shown in figure 21, and package the parser directly to the home directory of the master node dong user, as shown in figure 22.

Fig. 21 schematic diagram of WindowWordCount in IntelliJ IDEA

Fig. 22 schematic diagram of WindowWordCount.jar on master

Step 3, run the analyzer. Open the terminal on the master node and submit the application to the Spark cluster with the following code.

View the result

Step 1 check the operation of the data flow simulator on slave1. The analyzer runs on the cluster and establishes a connection with the data flow simulator running on the slave1. When an external connection is detected, the data flow simulator randomly intercepts a line of text from / home/dong/Streamingtext/file1.txt every 1000 milliseconds and sends it to port 9999 of the slave1 node. Because each line in the text file contains only one word and one comma, only one word and one comma are sent to the port per second, as shown in figure 23.

Fig. 23 schematic diagram of data flow simulator running on slave1

Step 2, check the operation of the parser on master. You can view the statistical results in the submission window of the master node. In the early days of the WindowWordCount application, the window was not filled with received words, but over time, the number of words in each window was eventually fixed at 30. Figure 7.35 only intercepts three batches of the run results. Because the window interval is 30s, the sliding interval is 10s, and the data flow simulator sends a word every 1s, WindowWordCount counts the number of words received in the past 30s every 10s. In figure 24, the 1479276925000ms analyzer counted 30 words received in the past 30 seconds, of which "spark" appeared 5 times, "hbase" 8 times, "hello" 9 times, and "world" 8 times. At an interval of 10 seconds, as of 1479276935000ms, the analyzer counted 30 words received in the past 30 seconds, including 8 times for "spark", 9 times for "hbase", 7 times for "hello" and 6 times for "world".

Figure 24 schematic diagram of the operation of the analyzer on master

Step 3, look at the persistent data. After running, look at the persistence directory / user/dong/input/WindowWordCountlog on HDFS, as shown in figure 25. Streaming applications persist the received network data to this directory for fault-tolerant processing.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.