Practice of Spark Streaming Advanced Features in NDCG Computing 04/28 Update SLTechnology News&Howtos

Practice of Spark Streaming Advanced Features in NDCG Computing

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

From storm to spark streaming, and then to flink, streaming computing has made great progress. Spark streaming, which relies on the spark platform, has gone its own way. It draws lessons from the spark batch architecture and realizes the real-time processing framework through batch processing. In order to learn more about spark streaming, on the evening of March 20th, Pegasus invited Wang Fuping, a former senior engineer of Baidu big data, to share the advanced features of spark streaming and ndcg computing practice in an online live broadcast.

The following are the main contents of this live broadcast:

I. introduction to Spark Streaming

What is 1.spark?

Spark is a batch processing framework, which has the advantages of high performance and rich ecology.

How did we do big data's analysis before spark? In fact, before spark, we used the MapReduce framework based on Hadoop to do data analysis. Today, traditional MapReduce tasks are not completely out of the market, and the performance of MapReduce is quite stable in some scenarios with a very large amount of data.

What is 2.spark streaming?

Spark streaming is a framework for batch processing of data in time, and the advantages brought by the .spark platform make spark streaming development simple and widely used.

Spark streaming is implemented based on the batch philosophy of spark, so it can directly use the tool components provided by the spark platform.

Through the figure above, we can treat the input of spark streaming as a data stream and process the data in batches over time, depending on our own business conditions.

An example of 3.WordCount:

As an example of WordCount, we can see that a WordCount is implemented in just a few lines of code. As the spark platform is directly connected with Hadoop, we can easily save the data to HDFS or database. As long as the operation and maintenance of a set of spark platform, we can do both real-time tasks and offline analysis tasks, which is more convenient.

II. Advanced features of Spark Streaming

1.Window features:

Based on the simple WordCount example above, let's upgrade it. Suppose we need to count the number of words appearing in the previous minute every ten seconds. This requirement cannot be realized by simple WordCount. In this case, we will use the Window mechanism provided by spark streaming.

With regard to the Window feature of spark streaming, there are three parameters to note: Batch Internal (batch interval), Window width (window length), and Sliding Internal (window sliding interval). According to the demand just now, the window length is 60s, the window sliding interval is 10s, and the batch interval is 1s. It should be noted that the batch interval must be divisible by the window length and window sliding interval.

Through the description, you may feel that the Window feature is a little complex, but in fact, to create a window of the flow is very simple, the following two diagrams, is about creating Window data flow and Window-related calculation functions, you can easily understand.

The following picture calculates the request failure rate during the 30-second window period. Let's take a look at its parameters. The window time is set to 30s and the sliding interval is 2s. The whole code is very simple, only one more line of code is needed to implement the window flow, and then the flow can do some normal calculations.

Let's briefly read this function, first create a window flow, then calculate the number of failed entries in the task, divide it by the total number of entries, and get the request failure rate.

2.Sql features:

The second feature of spark streaming is the Sql feature, which can be used naturally after spark streaming encapsulates the data into DataFrame.

To fully use the way of writing sql, we first need to register the temporary table. Our registered temporary tables can also be associated with multiple temporary tables built by us to do join, which is more practical.

With sql, custom functions bring us a lot of extensibility, and there are two ways to define UDF: load the jar package UDF and dynamically define UDF.

4.CheckPoint mechanism:

Spark uses CheckPoint to save the processing state and even the current processing data. Once the task fails, CheckPoint can be used to recover the data. We do data processing, data reliability is very important, we must ensure that the data is not lost, Spark's CheckPoint mechanism is to help us ensure data security.

There are two main CheckPoint mechanisms:

So how to implement the CheckPoint mechanism?

There are three conditions:

Let's compare the two pictures with and without WAL. In fact, there is WAL, which will first save the data to HDFS, then back up the task logic, and then perform processing. When the task fails, it will read the data saved by HDFS according to the data of CheckPoint and restore the task. But in fact, this will have disadvantages, on the one hand, it reduces the performance of receivers, on the other hand, it can only guarantee At-Least-Once, not exactly-once.

In view of the shortcomings of WAL, spark streaming optimizes kafka and provides Kafka direct API, which greatly improves the performance.

III. Calculation of NDCG indicators

What is 1.NDCG?

The following two pictures are concrete examples of NDCG calculation.

2.NDCG is implemented in spark streaming:

How do we implement NDCG computing with spark streaming? First of all, we did a data survey.

Start NDCG calculation.

3.NDCG performance Assurance:

We develop a data task, not static work, to ensure the stability of the data, according to the situation of the data, do a capacity estimate to ensure the performance of the data. Capacity estimation is an essential step.

Our most common capacity adjustment.

In the process of NDCG index calculation, we will also encounter some problems, that is, NDCG supports the combination of four dimensions, and the combination of dimensions is more and more complex.

At this time, multidimensional analysis has to rely on our OLAP engine, and we are currently using Druid.

The above three parts are the main contents of this online live sharing. In the end, Mr. Wang also answered everyone's questions one by one. What are the questions? Let's take a look.

1. To read a batch of data every 5 seconds, we need to traverse the daily data for all kinds of calculation and analysis, and the calculation results also need to be cached as a reference for the next calculation.

Mr. Wang: this is a real-time task. If you need to store state data, there are several ways to implement it. The first is that spark streaming has a mechanism for saving state data. The second way is that you can save state data in some KV databases, such as spark, etc., and you can also implement it yourself in this way. No matter which way, the key lies in how to achieve it.

two。 Is there any recommended way to board the ship to learn from spark?

Mr. Wang: let's not look at spark so magically. The knowledge about stream processing provided in java8 is not much different from writing spark. The principle is the same. You understand how java8 is written and the various methods and computing logic handled by stream. Then you can understand all kinds of computing logic in spark streaming. The only high-end thing about spark streaming is that it is distributed.

3. What technology is most likely to be replaced by spark streaming in the future?

Mr. Wang: each platform has its own advantages and disadvantages. At present, although Flink is relatively popular, Storm still exists, Spark also has its own suitable scene, and Flink also has its own advanced mechanism, so each has its own advantages.

Finally, Mr. Wang recommended to you the most classic book about scala-"programming in scala". This live broadcast for spark streaming is concise and targeted. I believe you must have gained a lot. For those of you who want more details, you can follow the service number: FMI Pegasus, click the menu bar Pegasus Live, and you can learn.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.