In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly explains "what are the characteristics of Data Lakehouse". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the characteristics of Data Lakehouse"?
Background
Data Lake and Data Lakehouse have become the hottest buzzwords in big data field. When accepting these buzzwords, as technicians, we often ask: is this a new technology, or is it just a conceptual renovation (new bottles of old wine)? What problems does it solve and what new features does it have? What is its current situation and what are the problems?
With these questions, today, from the author's understanding, we will unveil the mystery of Data Lakehouse and explore what is the nature of technology.
Data Lakehouse (integrated lake and warehouse) is a new data architecture, which absorbs the advantages of both data warehouse and data lake. Data analysts and data scientists can operate data in the same data storage. At the same time, it can also bring more convenience for companies to manage data. So what is Data Lakehouse, and what features does it have?
What are the features of Data Lakehouse?
For a long time, we have been using two ways of data storage to structure data:
Data warehouse: a data storage architecture such as a data warehouse, which mainly stores structured data organized by relational databases. The data is transformed, consolidated, cleaned, and imported into the target table. In a data warehouse, the structure of the data store strongly matches its defined schema.
Data lake: a data storage structure such as a data lake that can store any type of data, including unstructured data such as pictures and documents. Data lakes are usually larger and cheaper to store. The data stored in it does not need to meet a specific schema, and the data lake does not attempt to implement a specific schema on it. In contrast, the owner of the data usually parses the schema (schema-on-read) when reading the data and imposes the transformation on it when the corresponding data is processed.
Nowadays, many companies often build two storage structures: data warehouse and data lake at the same time, a large data warehouse and several small data lakes. In this way, the data will have some redundancy in these two kinds of storage.
The emergence of Data Lakehouse attempts to integrate the difference between the data warehouse and the data lake. By building the data warehouse on the data lake, the storage becomes cheaper and more flexible. At the same time, lakehouse can effectively improve the data quality and reduce data redundancy. In the construction of lakehouse, ETL plays a very important role, it can transform unstructured data into structured data.
The concept of Data Lakehouse was put forward by Databricks in this paper [1]. While putting forward the concept, it also lists the following characteristics:
Transaction support: Lakehouse can handle multiple different data pipes. This means that it can support concurrent read and write transactions without compromising data integrity.
Schemas: data warehouses impose Schema on all data stored on them, while data lakes do not. The architecture of Lakehouse can apply schema to most of the data according to the needs of the application and standardize it.
Support for reports and analysis applications: this storage architecture can be used by both report and analysis applications. The data stored in Lakehouse has been cleaned and integrated, and it can be used to speed up analysis. At the same time, compared with the data warehouse, it can save more data, the timeliness of the data will be higher, and can significantly improve the quality of the report.
Data type extension: the warehouse can only support structured data, while the structure of Lakehouse can support more different types of data, including files, video, audio, and system logs. End-to-end streaming support: Lakehouse can support streaming analysis, which can meet the needs of real-time reports. Real-time reports are becoming more and more important in more and more enterprises.
Computing storage separation: we often use low-cost hardware and clustering architecture to implement the data lake, which provides very cheap separate storage. Lakehouse is built on top of the data lake, so naturally it also adopts a separate structure of deposit and calculation, in which the data is stored in one cluster and processed in another.
Openness: Lakehouse in its construction will usually make Iceberg,Hudi,Delta Lake and other building components, first of all, these components are open source open, secondly, these components use an open and compatible storage format such as Parquet,ORC as the underlying data storage format, so different engines, different languages can operate on Lakehouse.
The concept of Lakehouse was first put forward by Databricks, and other similar products are Azure Synapse Analytics. Lakehouse technology is still in development, so the features described above will be constantly revised and improved.
What problem did Data lakehouse solve?
So after talking about the features of Data Lakehouse, what problem does it solve?
Over the years, in many companies, data warehouses and data lakes have coexisted and developed separately, and have not encountered too serious problems. But there are still some areas where there is room for progress, such as:
Data repeatability: if an organization maintains one data lake and multiple data warehouses at the same time, this will undoubtedly lead to data redundancy. At best, this will only lead to inefficient data processing, but at worst, it can lead to data inconsistencies. Data Lakehouse unifies everything, removes the repeatability of data, and really achieves Single Version of Truth.
High storage costs: data warehouses and data lakes are designed to reduce the cost of data storage. Data warehouses often reduce costs by reducing redundancy and integrating heterogeneous data sources. Data Lake, on the other hand, often uses big data file systems (such as Hadoop HDFS) and Spark to store computing data on cheap hardware. The cheapest way is to combine these technologies to reduce costs, which is the goal of the current Lakehouse architecture.
Differences between reporting and analytical applications: report analysts often tend to use integrated data, such as data warehouses or data marts. On the other hand, data scientists are more likely to deal with the data lake, using a variety of analytical techniques to deal with raw data. In an organization, there is often not much overlap between the two teams, but in fact there is a certain degree of repetition and contradiction in their work. When using Data Lakehouse, two teams can work on the same data schema to avoid unnecessary duplication.
Data stagnation (Data stagnation): data stagnation is the most serious problem in the data lake, and if the data remains unmanaged, it will quickly become a data swamp. We often throw the data into the lake easily, but lack of effective governance. In the long run, the timeliness of the data becomes more and more difficult to trace. The introduction of Lakehouse, catalog for massive data, can more effectively help to improve the timeliness of analysis data.
The risk of potential incompatibility: data analysis is still an emerging technology, and new tools and technologies are still emerging every year. Some technologies may only be compatible with data lakes, while others may only be compatible with data warehouses. Lakehouse's flexible architecture means that companies can prepare for the future in two ways.
Problems in Data Lakehouse
There are still some problems with the existing Lakehouse architecture, the most significant of which are:
Unified architecture: Lakehouse's unified architecture has many points, but it also introduces some problems. In general, a unified architecture is inflexible, difficult to maintain, and difficult to meet the needs of all users, and architects tend to use multi-mode architecture to customize different paradigms for different scenarios.
It is not a fundamental improvement in the existing architecture: there are still questions about whether Lakehouse can really bring additional value. At the same time, there are different opinions-combining existing data warehouses, data lake structures and appropriate tools-will lead to similar efficiency?
The technology is not yet mature: Lakehouse technology is not yet mature and there is still a long way to go before the capabilities mentioned above are achieved.
Thank you for your reading, the above is the content of "what are the characteristics of Data Lakehouse", after the study of this article, I believe you have a deeper understanding of the characteristics of Data Lakehouse, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.