How to use GPU acceleration in Spark 3.0 04/16 Update SLTechnology News&Howtos

How to use GPU acceleration in Spark 3.0

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I would like to talk to you about how to use GPU acceleration in Spark 3.0. many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something from this article.

Overview

RAPIDS Accelerator for Apache Spark uses GPUs data to accelerate processing, which is achieved through RAPIDS libraries.

When data scientists shift from traditional data analysis to AI applications to meet the needs of complex markets, traditional CPU-based processing can no longer meet the needs of speed and cost. The rapid growth of AI analysis requires a new framework to quickly process data and save costs, and this can be achieved through GPUs.

RAPIDS Accelerator for Apache Spark integrates RAPIDS cuDF library and Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in acceleration shuffle based on UCX and can be configured for GPU-to-GPU communication and RDMA capabilities.

Spark RAPIDS download v0.4.1

RAPIDS Spark Package

CuDF 11.0 Package

CuDF 10.2 Package

CuDF 10.1 Package

RAPIDS Notebooks

CuML Notebooks

CuGraph Notebooks

CLX Notebooks

CuSpatial Notebooks

Cuxfilter Notebooks

XGBoost Notebooks

Introduction

These notebooks provide examples of using RAPIDS. Designed to self-include runtime version of the RAPIDS Docker Container and RAPIDS Nightly Docker Containers and can run on air-gapped systems. You can quickly get the container and then install and use it according to RAPIDS.ai Getting Started page.

Usage

Get the latest notebook repo update, run. / update.sh or use the command:

Git submodule update-init-remote-no-single-branch-depth 1

Download CUDA Installer for Linux Ubuntu 20.04 x86room64

The basic installation is as follows:

Basic installer installation instructions: wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pinsudo mv cuda-ubuntu2004.pin / etc/apt/preferences.d/cuda-repository-pin-600wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.debsudo dpkg I cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.debsudo apt-key add / var/cuda-repo-ubuntu2004-11-1-local/7fa2af80.pubsudo apt-get updatesudo apt-get-y install cuda

The CUDA Toolkit contains open source project software, which can be found at here.

You can find checksums for installers and patches in Installer Checksums.

Performance & cost and benefit

Rapids Accelerator for Apache Spark benefits from GPU performance while reducing costs. As follows: * ETL for FannieMae Mortgage Dataset (~ 200GB) as shown in our demo. Costs based on Cloud T4 GPU instance market price & V100 GPU price on Databricks Standard edition.

Easy to use

No code changes are required to run previous Apache Spark applications. Start Spark with the RAPIDS Accelerator for Apache Spark plugin jar and open the configuration, as follows:

Spark.conf.set ('spark.rapids.sql.enabled','true')

Physical plan with operators runs on GPU

A unified AI framework for ETL + ML/DL

Single pipeline, from data preparation to model training:

Start using RAPIDS Accelerator for Apache Spark

Apache Spark 3.0 + provides users with plugin to replace SQL and DataFrame operations. No changes are required to API, which replaces SQL operations with the accelerated version of GPU. If this operation does not support GPU acceleration, the Spark CPU version will be used instead.

⚠️ note that plugin does not speed up direct operations on RDDs.

The accelerator library also provides an implementation of Spark's shuffle, which can be used to optimize GPU data transfers,keeping as much data on the GPU as possible and bypassing the CPU to do GPU to GPU transfers using UCX.

The GPU acceleration handles shuffle implementations that plugin does not require acceleration. However, if the acceleration SQL processing is not turned on, the shuffle implementation will use the default SortShuffleManager.

To enable GPU processing acceleration, you need:

Apache Spark 3.0 +

A spark cluster configured with GPUs that comply with the requirements for the version of cudf.

One GPU per executor.

The following jars:

A cudf jar that corresponds to the version of CUDA available on your cluster.

RAPIDS Spark accelerator plugin jar.

To set the config spark.plugins to com.nvidia.spark.SQLPlugin

Overview of Spark GPU scheduling

Apache Spark 3.0 now supports GPU scheduling just like cluster manager. You can ask Spark to request GPUs and then give it to tasks. The exact configuration depends on the configuration of cluster manager. Here are some examples:

Request your executor to have GPUs:

-- conf spark.executor.resource.gpu.amount=1

Specify the number of GPUs per task:

-- conf spark.task.resource.gpu.amount=1

Specify a GPU discovery script (required on YARN and K8S):

-- conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh

Review the details of the deployment to determine its methods and limitations.

Note that spark.task.resource.gpu.amount can be a decimal, and if you want to multiple tasks to be run on an executor at the same time and assigned to the same GPU, you can set it to a decimal less than 1. To correspond to the spark.executor.cores settings. For example, spark.executor.cores=2 will allow 2 tasks in each executor, and if you want 2 tasks to run on the same GPU, spark.task.resource.gpu.amount=0.5 will be set.

After reading the above, do you have any further understanding of how Spark 3.0 uses GPU acceleration? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.