Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Sample Analysis of Lambda Architecture of Apache Spark

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "sample Analysis of Lambda Architecture of Apache Spark". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

A brief history of Apache Hadoop

Apache Hadoop was officially introduced by Apache Software Foundation as part of Nutch, a sub-project of Lucene in the fall of 2005. It is inspired by Map/Reduce and Google File System (GFS) developed by Google Lab. It has been an independent project for 10 years.

At present, many customers have implemented the Hadoop-based M / R pipeline, and have successfully run it up to now:

Oozie's workflow runs daily to process data above 150TB and generate analysis reports.

Bash's workflow runs daily to process data above 8TB and generate analysis reports.

2016 is coming!

The business reality changed in 2016, and the faster decisions are made, the more valuable they tend to be. In addition, the technology itself is also developing, Kafka,Storm,Trident,Samza,Spark,Flink,Parquet,Avro, cloud providers and so on have become the buzzwords of engineers.

As a result, a modern Hadoop-based M / R pipeline might look like this:

The map of the Mazar R channel looks good, but in fact, it is essentially a traditional batch processing, with the shortcomings of traditional batch processing, when new data continues to enter the system, it still takes a lot of time to process.

Lambda architecture

To solve the above problems, Nathan Marz proposes a general, scalable and fault-tolerant data processing architecture, namely Lambda architecture, which processes large amounts of data by using batch processing and streaming methods. Nathan Marz's book gives a detailed introduction to the Lambda architecture from a source point of view.

Layer structure

This is the top-down layer structure of the Lambda architecture:

After entering the system, all data are assigned to the batch layer and speed layer for processing. The batch layer manages the main dataset (an immutable original dataset that can only be added) and pre-calculates the batch view. The service layer indexes the batch view so that low-latency temporary queries can be made. The speed layer processes only the most recent data. All query results must combine the query results of batch view and real-time view.

Keystone

Many engineers think that the Lambda architecture consists only of layer structures and defining data flows, but Nathan Marz's book introduces us to a few other important points:

Distributed thought

Avoid incremental structure

Invariance of data

Create a recalculation algorithm

Correlation of data

As mentioned earlier, any query results must be merged by merging results from batch and real-time views, so these views must be mergeable. One thing to note here is that the real-time view is a function of the previous real-time view and the new data increment, so the incremental algorithm is used here, and the batch view is a function of all data, so the recalculation algorithm should be used.

Tradeoff

Everything in the world develops in the process of constant compromise and tradeoff, and the Lambda structure is no exception. In general, we need to address several major trade-offs:

Completely recalculate vs. Partial recalculation

In some cases, Bloom filters can be used to avoid complete recalculation

Recalculation algorithm vs. Increment algorithm

The incremental algorithm is actually very attractive, but sometimes according to the guidelines, we have to use the recalculation algorithm, even if it is difficult to get the same results.

Addition algorithm vs. Approximate algorithm

Although the Lambda architecture works well with addition algorithms, it is more suitable to use approximate algorithms in some cases, such as using HyperLogLog to deal with count-distinct problems.

Realize

There are many ways to implement the Lambda architecture because the underlying solution of each layer is independent. Each layer requires specific functionality implemented at the underlying level to help make better choices and avoid excessive decision-making:

Batch layer: write once, read multiple times in batch

Service layer: supports random reads but not random writes; batch computing and bulk writes

Speed layer: random read and write; incremental calculation

For example, one of the implementations (using Kafka,Apache Hadoop,Voldemort,Twitter Storm,Cassandra) might look like this:

Apache Spark

Apache Spark is seen as an integrated solution for processing at all Lambda architectural layers. Spark Core includes advanced API and optimization engine that supports regular execution diagrams, SparkSQL is used for SQL and structured data processing, and Spark Streaming supports scalable, high-throughput, fault-tolerant flow processing of real-time data streams. Of course, the price of batch processing with Spark may be high, and not all scenarios and data are suitable. However, in general, Apache Spark is a reasonable implementation of the Lambda architecture.

Sample application

Let's create a sample application to demonstrate the Lambda architecture. The main purpose of this example is to count the # morningatlohika tweets hash tags from a certain time to the present moment.

Batch view

For simplicity, assume that our master dataset contains all the tweets since the start of time. In addition, we implemented a batch and created the batch view needed for our business goals, so we have a precomputed batch view that contains statistics for all topic tags used with # morningatlohika:

Because the numbers are easy to remember, I use the number of letters of the English words of the corresponding label as the number.

Real-time view

When the application is up and running, someone issues the following tweet:

In this case, the correct real-time view should contain the following tags and their statistics (1 in our example, because the corresponding hash tag is used only once):

Query

When the end user queries the statistical results of the hash tag, we only need to combine the batch view with the real-time view. So the output should look like this:

Scene

The simplified steps for the sample scenario are as follows:

Create a batch view (.parquet) through Apache Spark

Cache batch views in Apache Spark

The streaming application connects to the Twitter

Real-time Monitoring # morningatlohika tweets

Build incremental real-time view

Query, that is, merging batch view and real-time view

Technical details

The source code is based on Apache Spark 1.6.x (before the introduction of structured streams). The Spark Streaming architecture is a pure micro batch architecture:

So when dealing with streaming applications, I use DStream to connect to Twitter that uses TwitterUtils:

In each microbatch (using a configurable batch interval), the statistics of the hashtags in the new tweets are calculated, and the state of the real-time view is updated using the updateStateByKey () state transition function. For simplicity, use temporary tables to store real-time views in memory.

The query service reflects the merge of batch and real-time views:

Output

The Hadoop-based Mmax R pipeline mentioned at the beginning of the article uses Apache Spark to optimize:

Postscript:

As mentioned earlier, Lambda Architecture has its advantages and disadvantages, so both supporters and opponents have it. Some people say that batch views and real-time views have a lot of repetitive logic because eventually they need to create mergeable views from a query perspective. So they created a Kappa architecture and called it a simplified version of the Lambda architecture. The Kappa architecture system removes the batch system and replaces it by providing data quickly through the streaming system:

But even in this case, Apache Spark can be applied to Kappa Architecture, such as a streaming system:

This is the end of "sample Analysis of Apache Spark's Lambda Architecture". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report