Nvidia CV-CUDA High performance Image processing Accelerator Library released in Alpha and open source on GitHub 07/04 Update SLTechnology News&Howtos

Nvidia CV-CUDA High performance Image processing Accelerator Library released in Alpha and open source on GitHub

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

CTOnews.com December 21 news, Nvidia announced that CV-CUDA (Computer Vision-Compute Unified Device Architecture) high-performance image processing acceleration library, recently released Alpha version, officially open source to developers around the world. Users can download and try it at GitHub.

CV-CUDA is an open source project that accelerates the construction of efficient preprocessing and post-processing steps through GPU in AI imaging and computer vision (CV) processes. CV-CUDA was jointly developed by NVIDIA and a byte-beating machine learning team in the early days.

With the development of short video App, video conference platform and VR / AR technology, video and image have gradually become the main components of global Internet traffic. Including these video images that we usually come into contact with, many of them are processed and enhanced by AI and computer Vision (CV) algorithms. However, with the rapid growth of social media and video sharing services, video image processing, which is the basis of AI image algorithm, has become a cost and bottleneck that can not be ignored in the computing process. Review some common examples of image processing to better understand the application scenarios of CV-CUDA.

(1) AI algorithm for image background blurring

Figure 1. AI background blurring (CPU pre-and post-processing scheme) Image background blurring is usually used in video conferencing, beauty retouching and other scenes. In these scenarios, we usually hope that the AI algorithm can blur the background outside the body, so that it can protect the privacy of users, beautify the image and so on. The process of image background blurring can be divided into three processes: pre-processing, DNN network and post-processing. The pre-processing process usually includes Resize, Padding, Image2Tensor and other operations on the image; DNN network can be some common segmentation network, such as Unet, etc.; post-processing process, usually includes Tensor2Mask, Crop, Resize, Denoise and other operations.

In the traditional image processing process, the pre-processing and post-processing parts are usually operated by CPU, which leads to 90% of the working time spent in the pre-and post-processing part of the whole image background blurring process, which becomes the bottleneck of the whole algorithm pipeline. If the pre-and post-processing can make good use of GPU acceleration, this will greatly improve the overall computing performance.

Figure 2. AI background blur (GPU pre-and post-processing scheme) when the pre-and post-processing part is put on the GPU, end-to-end GPU acceleration can be performed on the entire pipeline. After testing, on a single GPU, compared with the traditional image processing methods, after transplanting the whole pipeline to GPU, the throughput can be improved by more than 20 times. There is no doubt that this will greatly save the calculation cost.

(2) AI algorithm for image classification

Figure 3. AI image classification is one of the most common AI image algorithms, which can usually be used for object recognition, image search and other scenes, which is the basis of almost all AI image algorithms. The pipeline of image classification can be divided into two parts: pre-processing part and DNN part. In the pre-processing part, the four most common operations in the process of training and reasoning include: picture decoding, Resize, Padding, Normalize. The DNN part already has GPU acceleration, while the pre-processing part usually uses the library functions on CPU for processing. If the pre-processing part can also be ported to GPU, on the one hand, we can release CPU resources, on the other hand, we can further improve the utilization of GPU, so that the whole pipeline can be accelerated.

For the pre-and post-processing part, there are some mainstream application schemes: image processing library is OpenCV, torchvision image processing library introduced by model training using PyTorch framework, and so on.

As mentioned above, the traditional image preprocessing operation is generally carried out on CPU, on the one hand, it takes up a lot of CPU resources, which makes the load of CPU and GPU unbalanced; on the other hand, because the image acceleration library based on CPU does not support batch operation, the efficiency of preprocessing is low. In order to solve some problems existing in the current mainstream image processing libraries, NVIDIA and the byte-beating machine learning team jointly developed the GPU-based image processing acceleration library CV-CUDA, which has the following characteristics:

(1) Batch

Support batch operation, can make full use of the parallel acceleration characteristics of GPU with high concurrency and high throughput, and improve computing efficiency and throughput.

(2) Variable Shape

Support for different image sizes in the same batch ensures flexibility in use. In addition, when processing pictures, you can specify different parameters for each picture. For example, when you call the RotateVarShape operator, you can specify a different rotation angle for each picture in the batch.

(3) C / C++/Python interface

The training and reasoning processes need to be aligned when deploying machine learning algorithms. Generally speaking, python is used for rapid verification during training, and C++ is used for high-performance deployment when reasoning. However, some image processing libraries only support python, which brings great inconvenience to deployment. If different image processing libraries are used in training and reasoning, the logic needs to be reimplemented on the reasoning side, and the process will be very tedious.

CV-CUDA provides C, C++ and Python interfaces, which can serve both training and reasoning scenarios. When moving from training to reasoning scenarios, the tedious alignment process can also be avoided and the deployment efficiency can be improved.

(4) Independent operator design

As a basic image processing library, CV-CUDA is designed by independent operators and does not need to define pipeline in advance. The design of the independent operator has higher flexibility, makes debugging easier, and can make it interact with other image processing, or integrate it into the user's own image processing upper framework.

(5) the results are aligned with OpenCV

Because of the inconsistent implementation of some operators in different image processing libraries, it is difficult to align the calculation results. For example, the common Resize operations, OpenCV, OpenCV-gpu and torchvision are all implemented in different ways, and the calculation results are different. Therefore, if we use OpenCV CPU version in training and GPU version or other image processing library in reasoning, we will face the problem of error in the result.

At the beginning of the design, we consider that in the current image processing library, many users are accustomed to using the CPU version of OpenCV, so when designing operators, whether in function parameters or image processing results, align the OpenCV CPU version of operators as much as possible. When users migrate from OpenCV to CV-CUDA, they can use it with only a few changes, and the image processing result is the same as that of OpenCV, and there is no need to retrain the model.

(6) ease of use

CV-CUDA provides Image, ImageBatchVarShape and other structures to facilitate the use of users. At the same time, the Allocator class is provided, and users can customize the video memory allocation strategy (for example, users can design the video memory pool allocation strategy to improve the speed of video memory allocation), which is convenient for the upper framework to integrate and manage resources. At present, CV-CUDA provides the data conversion interface of PyTorch, OpenCV and Pillow, which is convenient for users to replace operators and mix different image databases.

(7) highly optimized performance for different GPU architectures

CV-CUDA can support GPU architectures such as Volta, Turing, Ampere, etc. According to the characteristics of different architecture GPU, the performance of CUDA kernel is highly optimized, which can be deployed on a large scale in cloud service scenarios.

CTOnews.com has learned that the CV-CUDA Beta version is expected to be released in March 2023 and the official version v1.0 in June.

For more information on CV-CUDA: click this link.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.