What is the Data Lake architecture like? 04/28 Update SLTechnology News&Howtos

What is the Data Lake architecture like?

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Data Lake architecture is how, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

1. Introduce

In order to create the maximum value in the organization's data environment, the traditional decision support system architecture is difficult to meet this requirement. New architectural patterns need to be developed to unleash the value of the data. To take full advantage of big data's value, organizations need to have a flexible data architecture and be able to get the most value from their data ecosystem.

The concept of Data Lake has existed for some time. However, I still see that it is difficult for many organizational structures to understand this concept because their understanding of it is still confined to the traditional enterprise data warehouse paradigm.

This article will delve into the concept of Data Lake architectural patterns and design an architectural pattern.

two。 Traditional data warehouse (DWH) architecture

The traditional enterprise DWH architecture model has been used for many years. Including data sources, data extraction, transformation and loading (ETL), and in the process, some structure will be created, cleaned up, and so on. In EDW, you need to pre-define the data model (size model or 3NF model), and then create a data Mart for OLAP multidimensional data analysis and self-service BI.

This architecture has been in service for many years.

However, this architecture has some inherent challenges that cannot be solved in the big data era. Some of them are as follows:

This architecture requires us to understand the data first. What is the data structure of the source system, what kind of data does it have, what is the cardinality, how should it be modeled according to business requirements, are there any exceptions in the data, and so on? This is a tedious and complex task, and it takes months to carry out requirements analysis or data analysis. And project deadlines often take months or even years.

We must also make choices and tradeoffs between the data to be stored and the data to be discarded. It takes a lot of time to decide what to introduce, how to introduce, how to store, how to convert, and so on. Less time is spent on the actual implementation of data discovery, data mining and value-added services.

3. Data definition

Now let's briefly discuss how the definition of data has changed. Big data's 4V is already well known, namely Volume,Velocity,Variety and veracity. The background is as follows:

Since the iPhone revolution, the amount of data has soared. There are 6 billion smartphones around the world, creating nearly 1PB data every day.

The data is not just static. A device that has streaming data and supports IoT.

It is also related to the diversity of data. Videos and photos have become data that need to be analyzed and used.

The explosive growth of data also brings challenges to data quality. In big data's era, which should be trusted and which should not be trusted is a bigger challenge.

In short, the definition of analysizable data is changing. Now it is not only structured data, but also all kinds of unstructured data. The challenge is to bring these data together and make them more meaningful.

4. Moore's law

Since 2000, great changes have taken place in processing capacity, storage and corresponding cost structure, which is constrained by Moore's Law. The key points are as follows:

Since 2000, processing capacity has increased by about 10000 times. This means that the ability to analyze more data effectively is improved.

The cost of storage has dropped a lot. Since 2000, storage costs have fallen by more than 1000 times.

5. Data lake metaphor

Use an analogy to explain the concept of Data Lake.

It is always a great pleasure to visit the Great Lakes. The water in the lake exists in its purest form, and different people carry out different activities on the lake. Some people are fishing, some people like to travel by boat, and the lake also provides drinking water for people living in Ontario. In short, the same lake has many uses.

With the change of data paradigm, a new architectural pattern has emerged. It is called the data lake architecture. Just like the water in the lake, the data in the data lake is stored in the most primitive form. Like a lake, it meets the needs of different people, including those who want to fish, those who want to take a boat tour, or those who want to drink water from the lake. It provides a way for data scientists to explore data and create hypotheses. It provides a way for business users to explore data. It provides a way for data analysts to analyze data and find patterns. It provides a way for report analysts to create reports and present them to stakeholders.

The data lake is compared to a data warehouse or data Mart as follows:

Data Lake stores data in the most primitive form, which can meet the needs of multiple stakeholders and can also be used to package data for end users to use. On the other hand, the data warehouse has been distilled and packaged (mineral water) for specific purpose data storage.

6. Data lake architecture

Through the previous background introduction, let's now understand the conceptual architecture of the data lake. The key components of the data lake architecture are structured and unstructured data sources, which are integrated into the original data storage and store the data in the most primitive way, that is, without any conversion. It is a cheap persistent storage that can store data on a large scale. Then we use the analysis sandbox to understand the data, create prototypes, conduct data science, and explore the data to create new assumptions and use cases.

Then we have a batch engine, which processes the raw data into data that can be used directly by the user, that is, a data structure that can be used to report to the end user. We call it a processed data store. There is a real-time processing engine that can capture stream data and process it. All data in this architecture has been classified and collated.

Let's take a look at each group of components in this architecture.

7. Lambda architecture

The first group of components is used to process data. It follows the Lambda architecture, which generally takes two processing paths: the batch layer and the real-time processing layer. The batch layer stores data in the most primitive form possible, that is, the raw data storage and real-time processing layer processes the data in almost real time. The real-time processing layer stores the data in the original data store and can store transient data before being loaded into the processed data store.

8. Analysis sandbox

The analysis sandbox is one of the key components in the data lake architecture. These are exploratory areas for data scientists where they can develop and test new assumptions, merge and explore data to form new use cases, create rapid prototypes to validate these use cases and realize what steps can be taken to extract value from them.

Simply put, it is a place where data scientists can discover data, extract value, and help transform the business.

9. Data cataloging (Catalog) and governance

Data cataloging is often ignored in traditional business architectures. Cataloging is a very important aspect in the field of big data. Let's give an example to illustrate its importance.

When I asked my client to guess the potential cost of the painting without providing cataloging information, the answers ranged from $100 to $100000. When I provide catalog information, the answer is closer to the actual situation. By the way, this painting, called "the old guitarist" by Pablo Picasso (Pablo Picasso), was created in 1903. The cost is estimated to be more than 100 million US dollars.

Data cataloging is very similar. Different data blocks have different values, and the value varies according to the lineage of the data, the quality of the data, the source of creation, and so on. Data needs to be classified so that data analysts or data scientists can decide which data to point to for specific analysis.

10. Data cataloging chart

Cataloging diagrams provide metadata that can be classified. Cataloging is the process of capturing valuable metadata, so it can be used to characterize the data and decide whether to use it. There are basically two types of metadata: business metadata and technical metadata. Business metadata is more related to definitions, logical data models, logical entities, etc., while technical metadata captures metadata related to the physical implementation of the data structure. It includes database, quality score, column, schema, and so on.

Based on the catalog information, analysts can choose to use specific data points in the correct context. For example, imagine that a data scientist wants to make an exploratory analysis of inventory turnover and how it is defined in ERP and inventory systems. If terminology is classified, the data scientist can decide whether to use columns from the ERP or the inventory system based on context.

11. Comparison between data lake and traditional data warehouse

The table above tries to explain the difference.

First of all, philosophy is different. In the data lake architecture, we first load the data originally, and then decide what to do with it. In the traditional DWH architecture, we must first understand the data, model it, and then load it.

The data in the data lake is stored in the original format, while the data in DWH is stored in a structured format, analogous to lake water and distilled water.

Data Lake supports a variety of users.

The analysis project is really an agile project. The essence of these projects is that once you see the output, you think more and want more. Data Lake is agile in nature. Because they store all their data in the catalog, they can ensure that they can easily adjust when new requirements arise.

12. Data Lake Architecture on AZURE

Cloud platform is most suitable for implementing data lake architecture. They have a large number of composable services that can be combined to achieve the desired scalability. Microsoft's Cortana Intelligence Suite provides one or more components that can be mapped to components that implement the data lake architecture.

Data Lake is a new paradigm of big data's architecture.

Data lake can meet the needs of all kinds of data. Storing data in the original format meets a wide range of user needs and provides faster insight.

Meticulous data cataloging and management is the key to the successful implementation of the data lake.

The cloud platform provides an end-to-end solution for implementing an economical and scalable data lake architecture.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.