Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is LakeHouse?

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

What is LakeHouse? For this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more small partners who want to solve this problem find a simpler and easier way.

1. introduced

Over the past few years at Databricks, we've seen a new paradigm of data management emerge in many clients and cases: LakeHouse. In this article, we describe this new paradigm and its advantages over previous schemes.

Data warehouse technology has been evolving since its inception in 1980 and has a long history in decision support and business intelligence applications, while MPP architecture enables systems to handle larger data volumes. But while data warehouses are great for structured data, many modern enterprises must deal with unstructured data, semi-structured data, and data with high diversity, speed, and volume. A data warehouse is not suitable for many of these scenarios and is not the most cost-effective.

As companies began collecting large amounts of data from many different sources, architects began to envision a single system to house data from different analytics products and workloads. About a decade ago, companies began building data lakes: repositories of raw data in various formats. While data lakes are good for storing data, they lack some key features: they do not support transactions, they do not improve data quality, and they lack consistency/isolation, making it almost impossible to mix append and read, batch and stream jobs. For these reasons, many of the promises made before the data lake have not been fulfilled and, in many cases, many of the benefits of data warehousing have been lost.

The company's need for flexible, high-performance systems, such as those that require a wide variety of data applications including SQL analytics, real-time monitoring, data science, and machine learning, has not diminished. Most recent advances in AI are models that can be used to better handle unstructured data (text, images, video, audio), precisely the types of data that data warehouses are not optimized for. A common solution is to use multiple systems, namely a data lake, several data warehouses, and other specialized systems such as stream, time series, graph, and image database systems. Maintaining a large number of systems introduces additional complexity and, more importantly, latency as data professionals move or replicate data between different systems.

2. What is LakeHouse?

New systems are emerging to address the limitations of data lakes, and LakeHouse is a new paradigm that combines the strengths of data lakes and data warehouses. LakeHouse uses a new system design that implements data structures and data management functions similar to those found in data warehouses directly on top of low-cost storage for data lakes. If you now need to redesign your data warehouse, LakeHouse is a good place to start, given that storage (in the form of object storage) is cheap and reliable.

LakeHouse has the following key features:

Transaction support: Many data pipelines within an enterprise often read and write data concurrently. ACID transaction support ensures that multiple parties can read and write data concurrently using SQL.

Schema enforcement and governance: LakeHouse should have a paradigm (such as star/snowflake-schemas) that supports schema enforcement and evolution and supports DW patterns. The system should be able to reason about data integrity and have robust governance and auditing mechanisms.

BI support: LakeHouse can use BI tools directly on source data. This improves data freshness, reduces latency, and reduces the cost of having to operate two copies of data simultaneously in a data lake and a data warehouse.

Separation of storage and compute: This means that storage and compute use separate clusters, so these systems can support more user concurrency and larger data volumes. Some modern data warehouses also have this property.

Openness: The storage formats used (such as Parquet) are open and standardized, and provide APIs so that various tools and engines (including machine learning and Python / R libraries) can access the data directly and efficiently.

Support for multiple data types from unstructured to structured: LakeHouse can be used to store, optimize, analyze, and access data types required for many data applications including images, video, audio, semi-structured data, and text.

Supports a variety of workloads: including data science, machine learning, and SQL and analytics. Multiple tools may be required to support these workloads, but they all depend on the same data repository underneath.

End-to-end streaming: Real-time reporting is a standard application in many enterprises. Flow support eliminates the need to build a separate system dedicated to serving real-time data applications.

While these are key features of LakeHouse, enterprise-class systems may require additional features, such as security and access control tools, which are essential requirements, especially given recent privacy regulations, data governance features including auditing, retention, and lineage become critical. You may also need to use data discovery tools such as catalogs and data usage metrics. With LakeHouse, you only need to implement, test, and manage such enterprise functionality for a single system.

3. early example

Databricks platform has LakeHouse features. Microsoft's Azure Synapse Analytics service integrates with Azure Databricks to implement a LakeHouse-like pattern, and other hosted services (such as BigQuery and Redshift Spectrum) have some of the LakeHouse features listed above, but they are primarily targeted at BI and other SQL applications. To build a system, companies can refer to open source file formats suitable for building LakeHouse (Delta Lake, Apache Iceberg, Apache Hudi).

Consolidating data lakes and data warehouses into one system means data teams can move faster because they don't have to access multiple systems to consume data. In the early days of LakeHouse, SQL integration with BI tools was generally sufficient for most enterprise data warehouse needs. While materialized views and stored procedures can be used, users may need to employ other mechanisms that differ from those found in traditional data warehouses. The latter is particularly important for "lift and shift scenarios," which require that the system have semantics that are nearly identical to those of older commercial data warehouses.

What about LakeHouse support for other types of data applications? LakeHouse users can use a variety of standard tools (Spark, Python, R, machine learning libraries) to handle non-BI workloads such as data science and machine learning. Data exploration and processing is the standard for many analytics and data science applications. Delta Lake allows users to gradually improve the quality of LakeHouse data until it is ready to use.

Although distributed file systems can be used for storage tiers, object storage is more common in LakeHouse. Object storage provides low-cost, high-availability storage that excels at large-scale concurrent reads, a fundamental requirement of modern data warehouses.

4. From BI to AI

LakeHouse is a new data management paradigm that radically simplifies enterprise data infrastructure and promises to accelerate innovation in an era when machine learning has penetrated every industry. Whereas in the past most of the data involved in a company's products or decisions was structured data from operating systems, today many products integrate AI in the form of computer vision and speech models, text mining, and more. And why use LakeHouse instead of Data Lake for AI? LakeHouse can provide data versioning, governance, security, and ACID attributes, even for unstructured data.

Current LakeHouses reduce costs, but their performance still lags behind specialized systems (such as data warehouses) that require years of investment and actual deployment. At the same time, users may prefer certain tools (BI tools, IDE, laptop), so LakeHouse also needs to improve its UX and connectors with popular tools in order to be more attractive. As technology matures and evolves, these problems will be solved. Over time, LakeHouse will close these gaps while retaining the core attributes of simpler, more cost-effective, and more powerful capabilities to serve a variety of data applications.

The answer to what is LakeHouse is shared here. I hope the above content can be helpful to everyone. If you still have a lot of doubts, you can pay attention to the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report