Lesson 82: Spark Streaming lesson 1: case hands-on and understand how it works in a dimite fire 04/19 Update SLTechnology News&Howtos

Lesson 82: Spark Streaming lesson 1: case hands-on and understand how it works in a dimite fire

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The contents of this issue:

1. Spark Streaming hands-on demonstration

2. Understand the principle of Spark Streaming like lightning

The case is put into practice and understand its working principle in the light of the fire.

Streaming, in the era of big data, for data flow processing, just like water flow, is a data flow; since it is data flow processing, we will think of the inflow of data, the processing of data, the outflow of data.

There are many different sources of data in daily work and life. For example, automobile manufacturing, monitoring equipment and industrial equipment in the industrial age will generate a lot of source data; e-commerce websites, log servers, social networks, financial trading systems, *, spam, traffic monitoring, etc., in the information age; mobile phones, tablets, smart devices, the Internet of things and so on in the communication era will generate a lot of real-time data, and data streams are everywhere.

What can Spark Streaming do in big data's time?

Usually users have the experience of online shopping, all kinds of operations carried out on the site can be monitored through Spark Streaming stream processing technology, and users' purchase hobbies, attention, transactions and so on can be analyzed. In the financial field, Spark Streaming stream processing technology can be used to monitor accounts with large trading volume to prevent criminal money laundering, property transfer, fraud and so on. In terms of network security, * occurs from time to time. Through Spark Streaming flow processing technology, a certain type of suspicious IP can be monitored and combined with machine learning training model to match whether the current request belongs to *. Other aspects, such as: spam monitoring and filtering, traffic monitoring, network monitoring, industrial equipment monitoring are all behind Spark Streaming to play a strong flow processing place.

In the era of big data, how to define the value of data?

All data that has not been processed by streaming is invalid or worthless; the value generated by processing immediately after the data is generated is the greatest, and the longer the data is placed or the more lag it is, the lower its use value. In the past, the vast majority of e-commerce websites made profits by network traffic (that is, the number of visits by users). Nowadays, e-commerce websites not only need to pay attention to traffic and transaction volume, but also need to flow all kinds of data of e-commerce websites through data flow technology. through real-time flow of data timely analysis, mining a variety of valuable data For example: specify user portraits for users with different transaction volume, so as to provide different quality of service; timely recommend relevant information for users to visit the e-commerce website.

SparkStreaming VS Hadoop MR:

Spark Streaming is a quasi-real-time streaming framework, while Hadoop MR is an offline, batch framework; it is clear that Spark Streaming is better than Hadoop MR in terms of data value.

SparkStreaming VS Storm:

Spark Streaming is a quasi-real-time stream processing framework, the processing response time is generally in minutes, that is to say, the delay time for processing real-time data is seconds; Storm is a real-time stream processing framework, processing response is millisecond. Therefore, the selection of flow framework depends on the specific business scenarios. What needs to be clarified is that many people now think that Spark Streaming stream processing is unstable, data loss, poor transactional support, and so on, because many people can't handle Spark Streaming and Spark itself. In terms of latency for Spark Streaming stream processing, DT_Spark big data DreamWorks's upcoming customized version of Spark will push the latency of Spark Streaming from seconds to less than 100ms.

Advantages of SparkStreaming:

1. Rich API is provided, and all kinds of complex business logic can be realized quickly in the enterprise.

2. The data flow into Spark Streaming is combined with machine learning algorithm to complete machine simulation and graph calculation.

3. Spark Streaming is based on the excellent lineage of Spark.

Can SparkStreaming process data one at a time, just like Storm?

Storm processes data on a per-unit basis, while SparkStreaming processes data on a per-unit basis. Can SparkStreaming be like Storm? The answer is: yes.

The general practice in the industry is to achieve this effect by partnering Spark Streaming and Kafka, as shown in the following figure:

The Kafka industry agrees with the most mainstream distributed message framework, which conforms to both the message broadcast pattern and the message queue pattern.

Technologies used internally by Kafka:

1 、 Cache

2 、 Interface

3. Persistence (maximum duration of one week by default)

4. Zero-Copy technology enables Kafka to handle hundreds of megabits per second, and the data only needs to be loaded once into the kernel for other applications to use.

External source data Push (Push) Kafka, and then grab (Pull) data through Spark Streaming, the amount of data captured can determine how much data to be processed in each second according to their own actual situation.

Hands-on wordCount instance through Spark Streaming

Here is to run a Spark Streaming program: count the number of words that flow in during this period. It calculates how many times each word appears in his specified period of time.

1. Start the Spark cluster first:

Let's open the official website from the cluster.

To accept this data for processing is the process of stream processing. The WordCount just now takes 1s as a unit.

When it was running just now, why was there no result? Because you need a data source.

2. Obtain the data source:

Open a new command terminal and enter:

$nc-lk 9999

Now let's copy the data source and get it running:

Then press enter to run.

Relationship between DStream and RDD:

No input data will print an empty result:

But in fact, the execution of Job is generated by the Spark Streaming framework and has nothing to do with the business logic of the Spark code written by the developers themselves, and the execution interval of the Spark Streaming framework can be manually configured, for example, a call to Job is generated every second. Therefore, when developers write Spark code (such as flatmap, map, collect), it will not cause job to run. Job running is generated by the Spark Streaming framework and can be configured to generate a job call every other second.

The data that Spark Streaming flows in is DStream, but the Spark Core framework only recognizes RDD, which is contradictory?

In the Spark Streaming framework, the generation of job instances is based on rdd instances. The code you write is the template of the job, that is, rdd is the template of the job. As soon as the template runs rdd, it will be executed, and action must process the data. The template of RDD is DStream discrete flow. If there is a dependency between RDD, DStream has a dependency, which constitutes a DStream directed acyclic graph. This DAG diagram is a template. Spark Streaming is just a thin package attached to RDD. The code you write cannot generate Job, only the framework can generate Job.

If the data can not be calculated in a second, it can only be tuned.

Author: Jiang Wei and his IMF-Spark Steaming Enterprise Development practice team

Chief Editor: Wang Jialin

Member blog address:

No. 1: Jiang Wei and his IMF-Spark Steaming enterprise development team http://www.cnblogs.com/sparkbigdata/p/5403963.html

Note:

Source: DT_ big data DreamWorks (IMF legendary Action Top Secret course)

For more private content, please follow the Wechat official account: DT_Spark

If you are interested in big data Spark, you can listen to the Spark permanent free open course offered by teacher Wang Jialin at 20:00 every evening, address YY room number: 68917580

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.