Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to analyze Lambda Architecture in big data

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to analyze the Lambda architecture in big data. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

How do we fight CAP theory? There is a CAP theorem in computer science that distributed data storage cannot provide more than two of the following three guarantees at the same time.

Consistency:

Each node reads the latest results or reports an error.

Availability:

Each request receives a (non-error) response, but it is not guaranteed to contain the latest writes.

Partition fault tolerance:

Although the network between nodes discards (or delays) any number of messages, the system continues to operate.

Brief history

In 2011, Nathan Marz put forward an important way to solve the limitation of CAP theorem in his blog, that is, Lambda architecture.

working principle

Let's take a closer look at the Lambda architecture. The Lambda architecture is divided into three layers: batch layer (batch layer), acceleration layer (speed layer), and service layer (serving layer).

It combines real-time (real-time) and batch (batches) processing of the same data.

First, the incoming real-time data stream is stored in the main dataset at the batch layer (batch layer) and in the memory cache at the acceleration layer (speed layer). The data in the batch layer is then indexed and made available through the batch view. The real-time data in the acceleration layer (speed layer) is exposed through the real-time view (real-time views). Finally, both batch and real-time views can be queried independently or together to answer any historical or real-time questions.

Batch layer (Batch layer)

This layer is responsible for managing the master data set. The data in the primary dataset must have the following three properties.

The data is original.

The data is immutable.

The data is always real.

The master data set is the source of truth of correctness. Even if you lose all service-tier and acceleration-tier datasets, you can rebuild the application from the master dataset.

The batch layer also precalculates the master data set into the batch view (batch views) so that low-latency queries can be made.

As our primary dataset is growing, we must develop a strategy to manage batch views (batch views) when new data is available.

Recalculate method:

Discard the old batch view and recalculate the function of the entire master data set.

Incremental algorithm:

Update the view directly when the new data arrives.

Acceleration layer (Speed layer)

Accelerated batch view indexing facilitates quick ad hoc queries (Ad hoc queries), which stores real-time views and processes incoming data streams to update those views. The underlying storage layer must meet the following scenarios.

Read randomly:

Support for fast random reading to quickly respond to queries.

Write randomly:

In order to support the incremental algorithm, the real-time view must be modified with low latency as much as possible.

Scalability:

Real-time views should be scaled according to the amount of data they store and the read / write rate required by the application.

Fault tolerance:

When the machine fails, the real-time view should continue to function normally.

Service layer (Serving layer)

This layer provides low-latency access to the results of calculations performed on the primary dataset. Read speed can be accelerated by indexes attached to the data. Similar to the acceleration layer, this layer must also meet the following requirements, such as random reads, bulk writes, scalability, and fault tolerance.

Lambda architecture can satisfy almost all attributes.

The Lambda architecture is based on several assumptions: fault tolerance, ad hoc query, scalability, and extensibility.

Fault tolerance: the Lambda architecture provides a more friendly fault tolerance for big data's system. Once an error occurs, we can fix the algorithm or recalculate the view from scratch.

Ad hoc query: the batch layer allows temporary queries against any data.

Scalability: all batch, acceleration, and service layers are easy to extend.

Because they are all fully distributed systems, we can easily scale up by adding new machines.

Extension: it's easy to add views, just to add a few new functions to the main data set.

How to synchronize the code between some problem layers

One way to solve this problem is to provide a common code base for each layer by using a common library or introducing some kind of abstraction shared between streams. Frameworks such as Summingbird or Lambdoop,Casado

Can we remove velocity layer (speed layer)?

Yes, the speed layer (speed layer) is not required in many applications. If we shorten the batch cycle, we can reduce the latency in data availability. On the other hand, new and faster tools for accessing data stored on Hadoop (such as Impala, Drill, or new versions of Tez, etc.) make it possible to perform certain operations on the data within a reasonable amount of time.

Can we discard the batch layer (batch layer) and process everything in the speed layer (speed layer)?

Yes, one example is the Kappa Kreps architecture, whose example suggests processing incoming data in the stream and restreaming it from the Kafka buffer whenever a larger history is needed, or if we have to go further back to the historical data cluster.

How to implement the Lambda architecture?

We can implement this architecture in the real world using the Hadoop data lake, where HDFS is used to store master data sets, Spark (or Storm) can form a speed layer (speed layer), and HBase (or Cassandra) is used as a service layer, and Hive creates queryable views.

Spark data skew and its solution

Corporate Yahoo using Lambda Architectur

For analysis on the advertising data warehouse, Yahoo took a similar approach, using Apache Storm,Apache Hadoop and Druid ².

Netflix

The Netflix Suro project is the backbone of the Netflix data pipeline, which has separate data processing paths but does not strictly follow the lambda architecture, because these paths may be used for different purposes and do not necessarily provide the same type of views.

LinkedIn

Use Apache Calcite to bridge offline and near-line calculations.

The above is how to analyze the Lambda architecture in big data, which is shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 261

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report