Mode selection and Construction of Cloud data Warehouse 04/21 Update SLTechnology News&Howtos

Mode selection and Construction of Cloud data Warehouse

2026-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The importance of data to an enterprise is self-evident. How to make good use of internal data and give full play to the greater value of data is particularly important for enterprise managers. As one of the most traditional data applications, data warehouse plays an important role in the enterprise. It is very important for data analysis to build and configure the data warehouse correctly. A well-designed data warehouse can make data analysts like a fish in water; otherwise, the enterprise may fall into endless problems and be at a disadvantage in the future enterprise competition.

As more and more infrastructure migrates to the cloud, does the data warehouse also need to be on the cloud? Can you solve many common problems such as performance, cost, ease of use, elasticity and so on after going to the cloud? If you consider going to the cloud, what aspects should you pay attention to? What are the characteristics of mainstream cloud manufacturers' products at present? In the face of the above problems, this paper tries to give some answers for your reference. Some of the contents of this article refer to the lecture materials of David J.DeWitt, a professor at MIT University.

I. Construction of data warehouse

There are many ways to build data warehouse (DW), and enterprises can choose according to their own needs. The following figure briefly lists the major DW construction schemes and makes an expanded comparison.

1.1 Construction plan

1) Business solution

The business solution is one of the most traditional and has been the mainstream in the past 20-30 years. The enterprise buys several warehouses, including the integrated delivery of software and hardware. There are many typical products, most of which are well-known international manufacturers, and domestic manufacturers also have some.

2) self-built + open source

This is a common approach adopted by many Internet companies, which is done by building their own underlying infrastructure and deploying open source software. The whole scheme is completely independent and controllable to the enterprise, but it has high technical requirements for its own personnel. A fairly typical product is the GreenPlum.

3) Cloud + open source

This is a variation of the previous solution, where the Iaas layer is provided by the cloud vendor, while the rest is still self-built. When the enterprise business has been on the cloud, this scheme is often adopted in order to better data integration and facilitate data migration.

4) DW Cloud

Enterprises directly choose the cloud services of the data warehouse instead of building them independently. In view of this situation, the following will focus on.

1.2 comparison of schemes

The above four schemes are compared from the aspects of cost, operation and maintenance, delivery, expansion, performance and so on.

Cost: including the cost of pre-purchase and late operation, as well as the conversion cost of personnel input.

Complexity of operation and maintenance: mainly aimed at evaluating the complexity of operation and maintenance work of the enterprise's own technical personnel.

Delivery speed: the overall delivery speed of the program, including the purchase and construction of infrastructure.

Scalability: including the integration of capacity expansion and performance expansion of data warehouse.

Performance: the overall performance of the warehouse.

1.3 focus on comparison-cost performance

As can be seen from the image above:

In scheme 1 and 2, the cost and performance are relatively fixed. Among them, option 1, the cost is high, but the performance is outstanding; scheme 2 (self-built), then both are medium.

In scheme 3 and 4, the cost and performance are in a range, and the range is wide. Option 3 mainly depends on the capability of the infrastructure provided by the cloud vendor. Option 4 relies on the cloud capabilities of cloud vendors. This also puts forward higher requirements for the choice of cloud manufacturers' products. This will be explained below.

II. Cloud data Warehouse 2.1 Cloud solution advantages

Based on the above description, cloud services using data warehouses have many advantages, including:

Better performance-to-price ratio (whether it is pre-purchase or post-operation)

Faster delivery (fastest in minutes)

Better resiliency (expand or compress, compute or store)

Lower complexity of operation and maintenance (unnecessary)

Easier data integration (if you are already on the same cloud)

Richer data ecology (depending on cloud vendor products)

2.2 key factors of data warehouse

Data warehouse is different from transactional database, it is built to facilitate the analysis of large amounts of data, not to deal with transactions. This means that data warehouses are often several orders of magnitude larger than their corresponding transactional databases, and may be less important for some key features of transactional databases (such as ACID, response time, etc.). On the contrary, the data warehouse has its own needs and can also be used as a factor for cloud selection.

1) multiple data integration methods

Putting data into a warehouse and formatting it correctly is often one of the biggest challenges facing a data warehouse. Traditionally, data warehouses have relied on batch extraction transformation load jobs-ETL. ETL jobs are still important, but now there is also the ability to fetch data from streaming, even allowing you to directly query data that is not in the warehouse.

2) support multiple query of data

In addition to supporting typical batch queries, existing data warehouses also need to support query methods such as adhoc classes. The MapReduce of the traditional big data technology stack hadoop is not suitable for this kind of query. Many data warehouses turn to massively parallel processing (MPP) databases, which are originally scattered and executed on multiple servers through parallel technology. In addition, there are also Spark, which uses memory parallel processing technology to complete queries.

3) Standard data access method

What language does the data warehouse support for querying? Obviously, standard SQL is the most user-friendly way, which can significantly reduce the threshold for users. In addition, high-level languages such as Python and R can also bring more ways of access for users. But some depend on dialects, which requires careful evaluation. After all, the cost of transplant is not a small expense.

4) flexible resource flexibility

Data warehouses are designed to deal with large amounts of data, but their scale can change greatly. In addition, the demand for computing resources will also change with the business. Therefore, there is a high demand for the flexibility of cloud-based data warehouse resources, which is different from the traditional self-building method. The resources here include not only computing resources, but also data storage resources. In addition, there is a need to distinguish whether computing and storage are supported separately, rather than tightly coupled together.

5) low cost of operation and maintenance

Data warehouse is a complex system, from the underlying physical resources, operating system, warehouse software, to the upper data objects, access statements and so on. As a data warehouse on the cloud, it is necessary to provide simple, flexible, automated and even intelligent operation and maintenance capabilities to facilitate customers to use and save users' comprehensive operation and maintenance costs.

6) flexible use

Data warehouse itself is a resource-intensive application, how to reduce the cost of users, cloud manufacturers need to consider. For example, support pause and resume functions, support independent expansion of computing and storage, and so on.

2.3 whether to go to the cloud / how to choose?

There are many benefits to using data warehouse cloud services. Do you have to go to the cloud? This needs to be combined with the needs of the enterprise, considering the following factors to decide.

1) is there enough technology accumulation?

Data warehouse itself has a high technical threshold, even if you choose open source, you need to explore the process of accumulation, unless it is the direct use of external commercial products.

2) are you already using the cloud?

If you are already a customer of a cloud, it will be easier to integrate data from the cloud. Otherwise, loading data across the cloud or locally will be a big project.

3) is there a high requirement for availability?

There are great differences among enterprises in this respect, such as enterprises pay more attention to usability, cloud vendors / commercial products undoubtedly have advantages.

4) is the data scale very large?

One of the core difficulties of data warehouse is the supported data scale. If the scale of enterprise data is very large, it will bring great challenges to the self-built way.

5) is there a strong need for expansion?

For example, in the period of rapid development, the data scale and usage complexity of the enterprise change greatly, which requires that the data warehouse has a good scalability and can quickly adapt to the development of the enterprise. Cloud solutions undoubtedly have advantages over the other three.

6) the characteristics of use change dramatically?

For example, the use of data in enterprises is very uncertain, that is, data warehouses are required to have good flexibility and can be flexibly expanded and scaled down as needed, and even for query capabilities. None of the three non-cloud solutions can adapt to rapid change.

7) the pressure of short-term cost is greater?

Enterprises have several positions of demand, but in the short term, both self-built and external commercial investment is too large, and cloud solutions can also be considered.

Third, two modes of counting warehouse

Data warehouse can be divided into two categories in terms of technical implementation. Before explaining the manufacturer's products below, simply popularize them.

3.1 Shared-Nothing

Nodes are interconnected through a high-speed network, so it is not necessary to access local resources through the network. This method is simple in design and good in expansibility. Under the better model design, the data does not need to move, and the processing efficiency is high.

The node itself has computing and storage resources, that is, the two need to be coupled together. This is the hard wound of this mode, that is, storage and computing can not be separated, and it is impossible to achieve independent elasticity on demand.

3.2 Shared Disk/Storage

Nodes access each other or nodes access storage, all need to go through high-speed network. The data itself is stored in "remote storage" rather than locally. The network may become a bottleneck, which is limited by the total amount of IO transmission. The network not only carries the data exchange traffic between nodes, but also bears a large amount of data access traffic.

This method is very flexible, and the calculation and storage can be expanded independently.

There is no unified conclusion on the advantages and disadvantages of the two methods, but the mainstream is to adopt the sharing method of shared disk/storage. But in this way, the performance of remote storage? How to take advantage of local storage? The impact of network performance on the whole? How to achieve dynamic resource allocation? The realization of expansion and reduction? And other problems are worth studying. What are the good infertility hospitals in Zhengzhou: http://www.xbzztj.com/

4. Typical Cloud Services 4.1 Amazon (AWS) Redshift

Redshift is a typical shared-nothing design with locally mounted storage. Make full use of the basic services of AWS, EC2 as a computing node, S3 as storage and fault recovery. The advantage lies in outstanding performance through tuning and customization, but its architecture also determines that computing and storage cannot be scaled independently. What does Zhengzhou infertility hospital have: http://wapyyk.39.net/zz3/zonghe/1d427.html

Supports loading data from multiple data sources, as well as integrated streaming data, but only supports structured data. Support to query the data on S3 directly without ETL. It supports the dialect of PostgreSQL and does not support some data types and functions. Redshift itself monitors the performance of the components and automatically recovers, and the user is responsible for other maintenance work. The daily operation and maintenance work is done by the user manually on the console.

4.2 Snowflake

Snowflake is a Shared-storage design that separates storage from computing. It is built on AWS and makes full use of the basic service capabilities of AWS. As a computing node, EC2 supports caching locally, and data tables are stored in S3. It puts forward the concept of "virtual warehouse". Each query can be assigned to different virtual warehouses, and different resources can be allocated for different warehouses. The performance of the warehouse will not be affected, and the warehouse itself is highly flexible and can automatically provide additional computing resources.

Support for structured and semi-structured data that can be ingested without the need for ETL or preprocessing. Although streaming data is not supported at first, you can connect to Spark to receive streaming data. It uses standard SQL and extends it appropriately. Its maintenance is relatively simple, there is no need to maintain the index, clean up data and other work.

4.3 Microsoft Azure SQL Data Warehouse

SDW is designed by Shared-Storage. Based on Microsoft's SQL Server PDW software, take advantage of Azure storage flexibility. Full compatibility with T-SQL, resources can be dynamically adjusted, and non-loaded access can be supported through Ploybase.

4.4 Google BigQuery

BigQuery is a storage and computing separation design, using the basic service capabilities of Google, stored in Collosus FS. The working mechanism is to convert the SQL query into low-level instructions, which are executed in turn. It completely abstracts the provision, allocation, maintenance, expansion and reduction of resources, all of which are handled automatically by Google. It is very suitable for scenarios where ease of use is the first demand. The storage automatically allocates slices according to the processing scale, load and so on. Computing resources are not proprietary and are reused by internal and external customers. You cannot explicitly control resource usage for a single query. Http://dalian.huodong.dqccc.com/exposition/detail-2237296.html (TB "processed") is used for billing.

In use, it supports standard SQL, semi-structured data types and external tables. Support loading or direct access from the Google cloud, or you can import data streams. It has no index and requires little maintenance except for data management.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.