How to analyze the Lambda architecture for real-time big data processing 07/06 Update SLTechnology News&Howtos

How to analyze the Lambda architecture for real-time big data processing

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze the Lambda architecture used for real-time big data processing, I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Background introduction of 1.Lambda Architecture

Lambda architecture is a real-time big data processing framework proposed by Nathan Marz, the author of Storm. Marz developed the famous real-time big data processing framework Storm,Lambda architecture during his Twitter work, which is based on years of experience in distributed big data systems.

The goal of Lambda architecture is to design an architecture that can meet the key characteristics of real-time big data system, including high fault tolerance, low latency and scalability. Lambda architecture integrates offline computing and real-time computing, integrates a series of architecture principles such as Immunability, read-write separation and complexity isolation, and can integrate various big data components such as Hadoop,Kafka,Storm,Spark,Hbase.

two。 Key characteristics of big data system

Marz believes that the big data system should have the following key features:

Robust and fault-tolerant (fault tolerance and robustness): for large-scale distributed systems, machines are unreliable and may crash, but the system needs to be robust and behave correctly, even if machine errors are encountered. Apart from machine mistakes, people are more likely to make mistakes. It is inevitable that there will be some Bug in software development, and the system must have sufficient adaptability to the error data written by programs with Bug, so the human operation fault tolerance is more important than machine fault tolerance. For large-scale distributed systems, human and machine errors may occur every day, how to deal with human and machine errors, so that the system can quickly recover from the errors is particularly important.

Low latency reads and updates (low latency): many applications require very high latency for read and write operations and low latency response to updates and queries.

Scalable (horizontal expansion): when the amount of data / load increases, the scalable system maintains performance by adding more machine resources. That is to say, the system needs to be linearly scalable, usually using scale out (by increasing the number of machines) rather than scale up (by enhancing the performance of machines).

General (versatility): the system needs to be able to adapt to a wide range of applications, including financial field, social network, e-commerce data analysis and so on.

Extensible (Extensible): when new functions and features need to be added, scalable systems can add new functions with minimum development cost.

Allows ad hoc queries (convenient query): there is value in the data, and it needs to be able to query the needed data conveniently and quickly.

Minimal maintenance (easy to maintain): if the system wants to be easy to maintain, the key is to control its complexity. The more complex the system is, the easier it is to make mistakes and the more difficult to maintain.

Debuggable (easy to debug): when something goes wrong, the system needs enough information to debug the error and find the source of the problem. The key is to be able to trace back to each data generation point.

3. The essence of data system

In order to design a system that can meet the key characteristics of big data mentioned above, we need to have an essential understanding of the data system. We can simplify the data system to:

Data system = data + query

Thus we can understand the essence of big data system from two aspects of data and query.

3.1. The nature of the data 3.1.1. Characteristics of data: When & What

Let's start with the characteristics of "data". Data is an inseparable unit, and data has two key properties: When and What.

When means that the data is related to time, and the data must be generated at a certain point in time. For example, the Log log implies the data generated in chronological order, and the log data in front of the Log must be generated before the log data behind the Log; the recipient of the message in the message system must be the message received after the sender of the message sent the message. Compared with the database, the records of the tables in the database lose the information of chronological order, and a record in the middle may be updated after the generation of a record. For distributed systems, the time characteristics of data are particularly important. Data in distributed systems may be generated in different systems, and time determines the global order in which data occurs. For example, if you do an arithmetic operation on a value, first + 2, then * 3, and then * 3, then + 2, the result is completely different. The time nature of the data determines the global occurrence sequence of the data, and also determines the result of the data.

What refers to the data itself. Because the data is related to a certain point in time, so the data itself is immutable (immutable), the past data has become a Fact, you can not go back to a certain point in time to change the data facts. This means that there are only two kinds of operations on data: reading existing data and adding more new data. Using database notation, CRUD becomes CR,Update and Delete are essentially newly generated data information, recorded in C.

3.1.2. Data storage: Store Everything Rawly and Immutably

According to the above analysis of the essential characteristics of data, the way to store data in Lamba architecture is: data is immutable and all data is stored.

By storing all data in an immutable manner, you can have the following benefits:

simple. Using an immutable data model, you only need to simply append data to the main data set when storing data. Instead of using a variable data model, data usually needs to be indexed for Update operations, so that the data to be updated can be quickly found for update operations.

Deal with human and machine errors. It is mentioned above that people and machines may make mistakes every day, so it is extremely important to deal with human and machine errors so that the system can recover quickly from them. Immutability (Immutability) and recalculation (Recomputation) are common ways to deal with human and machine errors. Using the variable data model, the data that caused the error may be overwritten and lost. Instead of using an immutable data model, because all the data is there, so is the data that caused the error. The fix can simply traverse all the data stored on the dataset, discard the wrong data, and recalculate the Views (View Conceptual reference 4.1.2). The key point of recalculation is that the global order determined by the time characteristics of the data is re-executed sequentially, and the correct results can be obtained.

At present, there are many examples in the industry that use immutable data models to store all data. For example, the distributed database Datomic stores data based on the immutable data model, thus simplifying the design. Distributed message middleware Kafka, based on Log log, stores messages by appending append-only.

3.2. Query

What is the concept of query? Marz gives a simple definition of query as follows:

Query = Function (All Data)

The meaning of the equation is that a query is a function applied to a dataset. The definition seems simple, but it covers almost all areas of databases and data systems: RDBMS, indexes, OLAP, OLTP, MapReduce, EFL, distributed file systems, NoSQL, and so on.

Let's take a closer look at the characteristics of the function so as to mine the characteristics of the function itself to execute the query.

There is a class of functions called Monoid features that are widely used. The concept of Monoid comes from Category Theory, one of its important characteristics is that it satisfies the law of association. For example, the addition of integers satisfies the Monoid feature:

(aqb) + centra + (breadc)

Functions that do not meet the Monoid characteristics can often be transformed into the operation of multiple functions that meet the Monoid characteristics. For example, the average Avg function of multiple numbers, multiple averages can not be directly combined to get the final average, but can be divided into denominators divided by numerators, denominators and numerators are integer addition, so as to meet the Monoid characteristics.

The associative law of Monoid is very important in distributed computing. Meeting the characteristic of Monoid means that we can decompose the computation into parallel operations on multiple machines, and then combine some of their computing results to get the final result. At the same time, it also means that some of the results can be stored and shared by other operations (if the operation also contains the same partial operation), thus reducing the workload of repeated operations.

4.Lambda architecture

With the discussion of the nature of the data system above, let's discuss the key problem of big data system: how to query on any big data set in real time? Big data, coupled with real-time calculation, the problem is relatively difficult.

The easiest way is to run the query function online on all data sets to get the results according to the previous query equation Query = Function (All Data). However, if the amount of data is relatively large, the computational cost of this method is too high, so it is not realistic.

The Lambda architecture solves this problem through a decomposed three-tier architecture: Batch Layer,Speed Layer and Serving Layer.

4.1.Batch Layer

There are two main functions of Batch Layer:

Storage data set

Pre-compute the query function on the dataset to build the View corresponding to the query

4.1.1. Save data set

According to the aforementioned discussion of the characteristics of data When&What, Batch Layer uses an immutable model to store all data. Because of the large amount of data, we can use big data storage scheme such as HDFS. If you need to store data in the chronological order in which it is generated, you can consider a time series database (TSDB) storage scheme such as InfluxDB.

4.1.2. Build query View

According to the equation Query = Function (All Data), it is too expensive to run query functions online on all data sets to get results. However, if we calculate and save the results of the query function on the dataset in advance, we can directly return the results (or get the results through simple processing operations) without re-calculating the full time-consuming. Here we can think of Batch Layer as a process of data preprocessing. We call the results pre-calculated and saved for the query as View,View is a core concept of the Lamba architecture. It is the optimization of the query, and the query results can be obtained quickly through View.

If we use HDFS to store data, we can use MapReduce to build the query View on the dataset. The work of Batch Layer can be simply represented by the following pseudo code:

The work seems simple, but in fact it is very powerful. Any man-made or machine error can be corrected and recalculated to get the correct result.

Understanding of View:

View is a concept that is closely related to the business. The creation of View needs to proceed from the needs of the business itself. A general database query system, the query corresponding functions are ever-changing, it is impossible to exhaustive. However, if you start from the needs of the business itself, you can find that the queries required by the business are often limited. An important work that Batch Layer needs to do is to examine all kinds of queries that may be needed according to the needs of the business, and define its corresponding Views on the data set according to the query.

4.2.Speed Layer

Batch Layer can deal with offline data very well, but there are many scenarios that are constantly generated in real time and need real-time query processing. Speed Layer is used to deal with incremental real-time data.

Speed Layer is similar to Batch Layer in that it calculates the data and generates Realtime View. The main differences are:

The data processed by Speed Layer is the most recent incremental data stream, and the entire data set processed by Batch Layer

For the sake of efficiency, Speed Layer constantly updates Realtime View when it receives new data, while Batch Layer gets Batch View directly from all offline data sets.

The Lambda architecture decomposes data processing into Batch Layer and Speed Layer with the following advantages:

Fault tolerance. The data processed in Speed Layer is also constantly written to Batch Layer. When the recalculated dataset in Batch Layer contains the dataset processed by Speed Layer, the current Realtime View can be discarded, which means that the errors introduced in Speed Layer processing can be corrected when Batch Layer is recalculated. This point can also be seen as the embodiment of the ultimate consistency (Eventual Consistency) in CAP theory.

Complexity isolation. Batch Layer deals with offline data and can be well controlled. Speed Layer uses incremental algorithm to deal with real-time data, which is much more complex than Batch Layer. By separating Batch Layer from Speed Layer and isolating complexity to Speed Layer, the robustness and reliability of the whole system can be improved.

4.3.Serving Layer

The Serving Layer of the Lambda architecture is used to respond to user's query requests and merge the resulting datasets in Batch View and Realtime View into the final dataset.

Here is the question of how to merge the data. Earlier, we discussed the Monoid property of the query function. If the query function satisfies the Monoid property, that is, it satisfies the binding rate, we only need to simply merge the result datasets in Batch View and Realtime View. Otherwise, the query function can be transformed into the operation of multiple query functions that satisfy the property of Monoid, and each query function that satisfies the property of Monoid can be merged with the result data set of Batch View and Realtime View separately, and then the final result data set can be calculated. In addition, according to the characteristics of the business itself, the rules of the business can be used to merge the resulting datasets in Batch View and Realtime View.

5.Big Picture

The above discusses the three layers of the Lambda architecture: Batch Layer,Speed Layer and Serving Layer. The following figure shows a complete view and process of the Lambda architecture.

After entering the system, the data stream is sent to Batch Layer and Speed Layer for processing at the same time. Batch Layer stores all data sets offline in an immutable model and constantly recalculates the Batch Views corresponding to building queries on all data sets. Speed Layer handles the incremental real-time data stream and constantly updates the Realtime Views corresponding to the query. In response to the user's query request, Serving Layer merges the resulting dataset from Batch View and Realtime View into the final dataset.

5.1.Lambda architecture component selection

The following figure shows the components commonly used in each layer of the Lambda architecture. Data flow storage can be based on immutable log distributed messaging system Kafka;Batch Layer data set storage can choose Hadoop's HDFS, or Aliyun's ODPS;Batch View precalculation can choose MapReduce or Spark;Batch View's own result data storage can use MySQL (query a small amount of recent result data), or HBase (query a large number of historical result data). Speed Layer incremental data can be processed by Storm or Spark Streaming;Realtime View incremental result dataset in order to meet the efficiency of real-time update, memory NoSQL such as Redis can be selected.

Principles for component selection of 5.2.Lambda Architecture

Lambda architecture is a general framework, do not limit the selection of each layer when the components given above, especially for the selection of View. From my practice of Lambda architecture, because View is a concept that is very relevant to the business, the key for View to select components is to select the components that are most suitable for query according to the needs of the business. The selection of different View components should deeply mine the characteristics of data and computing, so as to select the components that are most suitable for the characteristics of data and computing. At the same time, different View can choose different components.

6.Lambda architecture vs. Event Sourcing vs. CQRS

Many existing design ideas and architectures can be seen in Lambda architecture, such as Event Sourcing and CQRS. Here we compare them with Lambda architecture in order to gain a deeper understanding of Lambda architecture.

6.1. Event tracing (Event Sourcing) vs. Lambda architecture

Event Sourcing is an architectural model put forward by the famous Uncle Martin Flower. Event Sourcing is essentially a way of persisting data, storing the Event itself that caused the change. Compared with the traditional data is persistent, it stores the result caused by the event, not the event itself, so that while we save the result, we actually lose the opportunity to trace the cause of the result.

Here we can see that the storage of data sets in the Lambda architecture is exactly the same as the idea in Event Sourcing, which essentially uses an immutable data model to store the events that cause the change rather than the result of the change. Thus, when an error occurs, we can trace back to the source, find the source of the error, recalculate the discarded error information to restore the system, and achieve the fault tolerance of the system.

6.2.CQRS vs. Lambda architecture

CQRS (Command Query Responsibility Segregation) separates the data modification operation from the query operation, which, like the Lambda architecture, is also a form of read-write separation. In Lambda architecture, the data is stored in an immutable way (write operation) and transformed into the Views corresponding to the query, and the query gets the result data directly from the View (read operation).

The separation of reading and writing separates the two perspectives of reading and writing, which brings the advantage of complexity isolation, thus simplifying the design of the system. Compared with the traditional processing method of putting read and write operations together, for systems with very complex read and write operations, the system will only become extremely complex and difficult to maintain.

After reading the above, have you mastered the method of how to analyze the Lambda architecture for real-time big data processing? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.