What is Cloudera virtual private cluster and SDX 04/12 Update SLTechnology News&Howtos

What is Cloudera virtual private cluster and SDX

2025-04-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what is Cloudera virtual private cluster and SDX. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

1 Overview

Virtual private cluster (Virtual Private Cluster,VPC) uses Cloudera shared data experience (Shared Data Experience,SDX) to simplify the deployment of local and cloud-based applications and to enable workloads running in different clusters to share data securely and flexibly. This architecture brings many advantages to deploying workloads and sharing data between applications, including sharing metadata, unified security, consistent data governance, and data lifecycle management.

In traditional CDH deployments, a cluster usually contains storage nodes, compute nodes, and other services such as metadata and security services. This traditional architecture has many advantages, such as Impala and YARN can access the same data source such as HDFS or Hive.

With the help of the VPC and SDX frameworks, CDH6.2 provides a new type of cluster called Compute Cluster (Compute cluster). Computing clusters run computing services such as Impala,Hive Execution Service,Spark or YARN, and then configure these clusters to access the same regular CDH cluster (Regular CDH cluster), which is called basic cluster (Base cluster). Using this architecture, computing and storage can be separated, thus improving the overall resource utilization.

2 the advantages of the separation of storage and computing

The storage and computing separation architecture can bring many advantages to CDH deployment:

1. Provide more options for deploying computing and storage resources

A) you can optionally deploy resources to local servers, containers, virtual machines, or clouds, depending on which deployment environment is appropriate for the workload. When configuring Compute clusters, you can configure hardware that is more suitable for computing workloads, while Base clusters can use hardware with larger storage. Cloudera recommends that each cluster use similar hardware.

B) Software resources can be optimized to make the best use of computing and storage resources

two。 Temporary cluster

When deploying clusters on cloud infrastructure, storage and computing separation allows you to temporarily shut down computing clusters to avoid unnecessary overhead-while data is still saved for use by other applications.

3. Isolated workload

A) Compute cluster can solve the problem of resource conflicts when users access it. Long-running workloads or resource-hungry workloads can be isolated and deployed to a proprietary Compute cluster to run without affecting other workloads.

B) Resources can be grouped by cluster, allowing IT teams to cost teams that use clusters based on resources.

3 Architecture

The Compute cluster is configured with computing resources, such as YARN,Spark,Hive Execution or Impala. Workloads running on these clusters access data through the data context (Data Context) connected to the Base cluster. The data context is the connector that connects to the Base cluster. The data context defines the data, metadata, and security services required to access the data deployed in the Base cluster. Both Compute and Base clusters are managed by the same Cloudera Manager. The Base cluster must deploy HDFS services and can also contain any other CDH services-but only HDFS,Hive,Sentry,Amazon S3 and Microsoft ADLS can be shared using the data context.

The Compute cluster requires HDFS services to hold temporary files used in multi-phase MapReduce jobs. In addition, deploy the following services as needed:

Hive Execution Service (this service provides only the HiveServer2 role)

Hue

Impala

Spark2

Oozie (Hue depends on this service)

YARN

HDFS (required)

The functionality of VPC is a subset of the features available in a regular cluster, and the version of CDH that you can use is limited.

4 performance tradeoff

4.1

Huff and puff

Because data is accessed through the network between the cluster and the cluster, this architecture is not suitable for workloads that need to scan large amounts of data. These types of workloads work better on regular clusters, where storage and computing are not separated, such as short loop (short-circuit) reads like Impala can lead to better performance.

4.2

Temporary cluster

When the Compute cluster is shut down or suspended because it is not needed, the service that collects historical data does not collect data when the Compute cluster is offline, and users cannot access the history. This affects services such as Spark History Server and YARN JobHistory Server. When the Compute cluster is restarted, you can access the previous history.

4.3

Data Governance and metadata in Compute Cluster

In an environment of one Base cluster and multiple Compute clusters, Navigator is designed to provide services for data governance and metadata of the Base cluster. It does not extract metadata and audit events from temporary Compute clusters. When configuring a cluster, if the user action is run against services and data on the Base cluster and operates on the Compute cluster using a controlled service account, Navigator will still track metadata and audit events.

Because audit events for running services on the Compute cluster are not collected, if you need to collect audit events for users, make sure that the workload running on the Compute cluster is the workload executed by the service users, and strictly control access to the service user accounts.

For services running on a Compute cluster, no metadata is collected. To ensure that the system collects asset and operational metadata in your environment, include services in the data context.

About what is Cloudera virtual private cluster and SDX to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.