What is the relationship between Hadoop and data warehouse 07/19 Update SLTechnology News&Howtos

What is the relationship between Hadoop and data warehouse

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

What is the relationship between Hadoop and data warehouse? For this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more small partners who want to solve this problem find a simpler and easier way.

RDBMS Welfare Package

Billions of dollars have been invested globally in infrastructure to run these databases, operated and refined by people to accommodate various vertical market applications. They remain the undisputed kings of deal processing.

Other benefits of RDBMS include:

Ability to recover from failures is very good and in most cases can be restored to the latest state

RDBMS can be easily distributed across multiple physical locations

RDBMS actually guarantees a high degree of data consistency

SQL is easy to learn.

A large number of IT talent familiar with RDBMS installed

Users can perform fairly complex data queries

What are the disadvantages? The truth is that as long as the data being managed is structured and relational, there is nothing wrong with it. Scalability is an issue because most of these systems are proprietary and core storage is expensive, especially as the database grows. But these ancient databases and their accompanying tools and applications are evident in every Fortune 1000 company, and for good reason: they can bring value.

But then came big data, much of it from unstructured heartland. It contains data from clickstreams, website logs, photos, videos, audio clips, XML documents, emails, tweets, etc.

Initially for IT departments, most of the data resembles background noise emanating from the depths of the universe-just a lot of noise. But remember this: A man named Arno Penzias deciphered the deep-space background noise in 1964, eventually interpreting it as proof of a proven Big Bang theory. He won the Nobel Prize.

The same goes for big data. Locked into all these different big data sources is proving to be valuable insights into customer behavior, market trends, service needs, and many other aspects. This is the big explosion of information technology.

Big data has become the largest component of the overall growth in data volume, and traditional analytics platforms and solutions are relatively unable to effectively handle unstructured data, so the analytics field is undergoing profound changes.

IT evolution, not revolution

But here's the important thing to remember. Big data analytics will not replace traditional structured data analytics, and certainly not in the foreseeable future.

On the contrary. As stated in The Executive's Guide to Big Data & Apache Hadoop,"When you combine big data with traditional information sources to come up with innovative solutions that generate significant business value, everything is fascinating. "

So you might see manufacturers associate their inventory systems (in RDBMS) with images and video descriptions in product catalogs based on document stores. This will help the customer to select and order the right part immediately.

Alternatively, a hotel chain can incorporate web-based real estate search results and its own historical occupancy metrics into an RDBMS to optimize nightly pricing and increase revenue through better revenue management.

Coexistence, not substitution. This is the right way to see the relationship between Hadoop-based big data analytics and the RDBMS and MPP worlds. As a result, organizations wisely focus on Hadoop distributions to optimize data flow between Hadoop-based data lakes and legacy systems. In other words, keep the old and innovate with the new.

Which platform to use?

There are three basic data architectures in common use: data warehouses, massively parallel processing systems (MPP), and Hadoop. Each one accommodates SQL in a different way.

A data warehouse is essentially a large database management system optimized for read-only queries across structured data. They are relational databases and therefore very SQL friendly. They offer fast performance and relatively easy management, largely because their symmetric multiprocessing (SMP) architecture shares resources such as memory and operating systems and routes all operations through a single processing node.

The biggest downside is cost and flexibility. Most data warehouses are built on proprietary hardware and are orders of magnitude more expensive than other approaches. In a financial comparison conducted by Wikibon, it was found that traditional data warehouses take more than six times longer to break even than data lakes.

Traditional data warehouses can only operate on data they know. They have fixed patterns and are less flexible when dealing with unstructured data. They are useful for transaction analysis, where decisions must be made quickly based on a defined set of data elements, but are less efficient in applications where relationships are ambiguous, such as recommendation engines.

MPP data warehouses are an evolution of traditional warehouses that utilize multiple processors bundled together via a common interconnect. SMP architecture shares everything between processors, while MPP architecture shares nothing. Each server has its own operating system, processor, memory and storage. The activities of multiple processors are coordinated by a master processor that distributes data across nodes and coordinates actions and results.

MPP data warehouses are highly scalable because adding processors results in almost linear growth in performance and typically costs less than a single-node data warehouse. MPP architecture is also ideal for working on multiple databases simultaneously. This makes them more flexible than traditional data warehouses. However, just like data warehouses, they can generally only handle structured data organized in patterns.

However, MPP architecture has the same limitations as SMP data warehouses. Because they require complex engineering, most are proprietary to individual vendors, making them costly and relatively inflexible. They are also subject to the same ETL requirements as traditional data warehouses.

From an SQL perspective, MPP data warehouses have one major architectural difference: rows are distributed sequentially among processors for maximum performance gains. This means that queries must account for the existence of multiple tables. Fortunately, most MPP vendors hide this detail in their SQL instances.

Hadoop is architecturally similar to MPP data warehouses, but with some notable differences. Processors are not strictly defined by a parallel architecture, but are loosely coupled across Hadoop clusters, and each processor can work on different data sources. The data manipulation engine, data catalog, and storage engine can work independently of each other, with Hadoop acting as the collection point. Crucially, Hadoop can easily accommodate structured and unstructured data. This makes it an ideal environment for iterative queries. Instead of trying to define the analysis output according to the narrow structure defined by the architecture, enterprise users can try to find the queries that matter most to them. Relevant data can then be extracted and loaded into a data warehouse for quick queries.

Let's look at the main differences between a data lake and a data warehouse (summarized from KDNuggets):

Data: Although data is structured in a data warehouse, data lakes support all data types: structured, semi-structured, or unstructured.

Processing: Data is in write mode in a data warehouse and read mode in a data lake.

Storage: Storing large amounts of data in a data warehouse can be expensive, and data lakes are designed for low-cost storage.

Agility: In a data warehouse, the data is in a fixed configuration, much less agile, and the data in a data lake is easy to configure as needed.

Users: The data lake approach supports all users (data scientists, business professionals), while the data warehouse is primarily used by business professionals.

The primary use case for Hadoop remains the "data lake" because it stores a lot of unstructured data for refinement and extraction into relational "data marts" or data warehouses. In fact, Gartner says they are seeing a significant increase in customer queries for data lakes, as shown below:

Just look at the numbers.@ Gartner_inc saw a 72% increase in queries to data lakes from 2014 to 2015.

- Nick Heudecker(@nheudecker)

To bring SQL functionality to Hadoop requires a lot of parallel work, but these projects all face the same structural barriers, namely that Hadoop is schema-free and data is unstructured. Applying "structured" query languages to unstructured data is a bit unnatural, but these projects are maturing rapidly. The architecture diagram below shows how some of these different approaches fit together in a modern data architecture.

About the relationship between Hadoop and data warehouse is what the answer to the question is shared here, I hope the above content can be of some help to everyone, if you still have a lot of doubts not solved, you can pay attention to the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.