How to understand big data's Lambda architecture 04/21 Update SLTechnology News&Howtos

How to understand big data's Lambda architecture

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to understand big data's Lambda architecture". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to understand big data's Lambda architecture".

1. The challenges dealt with by big data

With the development of it, the first stage is the emergence of major systems and platforms, which solves the efficiency problem of moving from offline to online, while the next stage is the data era. Dealing with the data accumulated by these platforms, the accumulated data is generally relatively large. What big data does, large-scale data processing is mainly offline, so there are three basic components of hadoop to solve big data storage respectively. Computing, big table storage, this stage basically solved big data's calculation, that is, you can write a program to complete big data's large-scale operation, and then there is real-time processing. The first one that appears is storm, which can deal with real-time individual data. This shows the latest data, but at the same time, you can also see what to do if you want both the latest and historical ones. So Nathan Mara, the author of Storm, proposed the Lambda architecture, which mainly solves how to combine the calculation results of offline data with the results of real-time processing to provide the final results.

two。 What are the characteristics of big data's Lambda architecture?

First of all, what we want is a framework in which online and offline computing results are combined. Imagine a credit scenario. I want to get all the lending institutions traded by a certain user, and suppose I use this result to calculate long points. The demand scenario is to get the latest data in real time. For example, if the transaction is institution An one second, then you have to get this institution the next. Well, for historical data, there must be stock calculation, and this kind of calculation must take a certain amount of time, and the An organization that traded in the last second generally does not immediately put it in the offline warehouse, but can only put this kind of data into real-time processing. To think carefully about this structure, it should have the following characteristics

At least ensure offline exact-once, the environment is sometimes unreliable, especially online systems, in ensuring exact-once is even worse, through offline recalculation to cover the online way, that is, the process of re-brushing data

Scalability, such as inefficient offline computing, can be achieved by adding resources

Maintainability, lambda architecture needs to ensure the consistency of online and offline computing logic, and try to achieve online and offline consistency in the same way as possible.

The data calculated offline can be queried through the query interface.

In general, the essence is data record + query service.

3. A brief introduction to big data's Lambda Architecture

From the point of view of requirements, we have got the mode of data record + query service from the point of view of requirements. Because of the different ways of writing data records, lambda architecture divides data records into offline batch computing layer and online real-time computing layer.

We get the following formula

In order to facilitate query Query is often used as a view, such a lambda architecture, there are many implementation schemes, such as batch computing layer, you can use spark,hive to calculate offline batch big data, while the real-time layer can use the program for real-time calculation, you can choose frameworks such as Flink, if the logic is not complex, you can also use the program to generate directly, as for storage.

Layering of 4.Lambda architecture

We talked about the three modules in the lambda architecture, which are offline computing layer, online computing layer and query service layer.

First of all, the offline computing layer, due to more historical data, will be put on the hdfs, the calculation method, the use of mr model calculation, if there is a problem, it is supported to batch recalculation to fix.

The second is query view, which provides services for merging offline preprocessed data and online calculation results.

Implementation example of 5.Lambda architecture

This architecture is an implementation, offline computing uses hive and spark, in order to align with online computing logic, using the same jar dependent way, but offline computing logic is in udf, while there is an enable_time to distinguish the time point of online offline data, eggroll can be understood as an offline kv storage database similar to hbase.

6. Thinking about the problem of Lambda Architecture

Lambda architecture has experienced years of development, its advantage is stable, for the real-time computing part of the computing cost can be controlled, batch processing can use the night time to the overall batch computing, so as to separate the real-time computing and offline computing peak, this architecture supports the early development of the data industry, but it also has some fatal shortcomings, and more and more can not meet the needs of data analysis business in big data 3.0era. The disadvantages are as follows:

The problem of data caliber caused by the inconsistency between real-time and batch computing results: because batch and real-time computing follow two computing frameworks and programs, the calculated results are often different, and it is often seen that a number is a data on the same day. The data of yesterday changed the next day.

Batch computing cannot be completed in the calculation window: in the era of IOT, the order of magnitude of data is getting larger and larger. It is often found that there is only a time window of 4 or 5 hours at night, and it is no longer possible to complete the accumulated data of more than 20 hours during the day. Ensuring that the data can be delivered on time before going to work in the morning has become a headache for every big data team.

Complexity of development and maintenance: the Lambda architecture requires programming the same business logic twice in two different API (application programming interface, application programming interface): an ETL system for batch computing and a Streaming system for streaming computing. Two code bases are generated for the same business problem, each with different vulnerabilities. This system is actually very difficult to maintain.

Large server storage: the typical design of a data warehouse will produce a large number of intermediate result tables, resulting in rapid data expansion and increasing server storage pressure.

Thank you for your reading, the above is the content of "how to understand big data Lambda architecture", after the study of this article, I believe you have a deeper understanding of how to understand big data Lambda architecture, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.