Setting up CUDA tools properly

CUDA is a computing platform for graphical processing units (GPUs) developed by NVIDIA, widely used to accelerate machine-learning. Existing frameworks, such as Tensorflow or PyTorch, utilize it under the hood not asking user for any specific coding. However, it is still necessary to set its dependencies, particularly the compiler nvcc, properly to benefit of acceleration. In this short note, I share an interesting use-case that occurred when prototyping on Kaggle Docker image and NVIDIA Docker image.

Compatibility of CUDA tools and targeted libraries

It turns out that one of Kaggle images was released with incompatible CUDA dependencies: compilation tools were not aligned with PyTorch, as revealed when attempting to compile detectron2, an object detection library by Facebook.

(base) maciej.skorski@shared-notebooks:~$ docker images
REPOSITORY                        TAG                        IMAGE ID       CREATED        SIZE
gcr.io/kaggle-gpu-images/python   latest                     87983e20c290   4 weeks ago    48.1GB
nvidia/cuda                       11.6.2-devel-ubuntu20.04   e1687ea9fbf2   7 weeks ago    5.75GB
gcr.io/kaggle-gpu-images/python   <none>                     2b12fe42f372   2 months ago   50.2GB

(base) maciej.skorski@shared-notebooks:~$ docker run -d \
  -it \
  --name kaggle-test \
  --runtime=nvidia \
  --mount type=bind,source=/home/maciej.skorski,target=/home \
  2b12fe42f372

(base) maciej.skorski@shared-notebooks:~$ docker exec -it kaggle-test python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
...
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.8). Please make sure to use the same CUDA versions.

In order to compile detectron2, it was necessary to align the CUDA toolkit version. Rather than trying to install it manually – which is known to be an error-prone task – a working solution was to change the Kaggle image. It turns out that the gap was bridged in a subsequent release:

(base) maciej.skorski@shared-notebooks:~$ docker run 87983e20c290 nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
(base) maciej.skorski@shared-notebooks:~$ docker run 2b12fe42f372 nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

And indeed, the Facebook library installed smoothly under the new image 馃憤

(base) maciej.skorski@shared-notebooks:~$ docker run -d \
   -it \
   --name kaggle-test \
   --runtime=nvidia \
   --mount type=bind,source=/home/maciej.skorski,target=/home \
   87983e20c290
bf60d0e3f3bdb42c5c08b24598bb3502b96ba2c461963d11b31c1fda85f9c26b
(base) maciej.skorski@shared-notebooks:~$ docker exec -it kaggle-test python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
Collecting git+https://github.com/facebookresearch/detectron2.git
...
Successfully built detectron2 fvcore antlr4-python3-runtime pycocotools

Compatibility of CUDA tools and GPU drivers

The compiler version should not be significantly newer than that that of the driver, as presented by nvidia-smi:

(base) maciej.skorski@shared-notebooks:~$ nvidia-smi
Thu Aug 10 14:56:44 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0    30W /  70W |  12262MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8635      C   ...detectron_venv/bin/python    12260MiB |
+-----------------------------------------------------------------------------+

Consider the simple CUDA script querying the GPU device properties:

// query_GPU.cu

#include <stdio.h> 

int main() {
  int nDevices;

  cudaGetDeviceCount(&nDevices);
  for (int i = 0; i < nDevices; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);
    printf("Device Number: %d\n", i);
    printf("  Name: %s\n", prop.name);
    printf("  Integrated: %d\n", prop.integrated);
    printf("  Compute capability: %d.%d\n", prop.major, prop.minor );
    printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
           2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
    printf( "  Total global mem: %ld\n", prop.totalGlobalMem );
    printf( "  Multiprocessor count: %d\n", prop.multiProcessorCount );
  }
}

This code compiles and presents GPU properties only under the image equipped with the matching major compiler version (select the appropriate image here):

(base) maciej.skorski@shared-notebooks:~$ docker run -d \
  -it \
  --name nvidia-cuda \
  --runtime=nvidia \
  --mount type=bind,source=$(pwd),target=/home \
  --privileged \
  nvidia/cuda:11.6.2-devel-ubuntu20.04

docker exec -it nvidia-cuda sh -c "nvcc /home/query_GPU.cu -o /home/query_GPU && /home/query_GPU"
Device Number: 0
  Name: Tesla T4
  Integrated: 0
  Compute capability: 7.5
  Peak Memory Bandwidth (GB/s): 320.064000

  Total global mem: 15634661376
  Multiprocessor count: 40

However, the container doesn’t even start with a mismatching major version:

(base) maciej.skorski@shared-notebooks:~$ docker run -d \
>   -it \
>   --name nvidia-cuda \
>   --runtime=nvidia \
>   --mount type=bind,source=$(pwd),target=/home \
>   --privileged \
>   nvidia/cuda:12.2.0-devel-ubuntu20.04
d14d07b8b04bc7e6e27ce8312452850946d98d82611cb24c3e662ceb27d708c5
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.2, please update your driver to a newer version, or use an earlier cuda container: unknown.

Fixing Reproducibility of Scientific Repos

As the last example, consider the recent cuZK project which implements some state-of-the-art cryptographic protocols on GPU. The original code was missing dependencies and compilation instructions, therefore I shared a working fork version.

To work with the code, let’s use the NVIDIA Docker image with the appropriate version, here I selected the tag 11.6.2-devel-ubuntu20.04. Checkout the code and start a container mounting the working directory with the GitHub code, like below:

docker run -d \
   -it \
   --name nvidia-cuda \
   --runtime=nvidia \
   --mount type=bind,source=$(pwd),target=/home \
   --privileged \
   nvidia/cuda:11.6.2-devel-ubuntu20.04

To work with the code, we need few more dependencies within the container:

apt-get update
apt-get install -y git libgmp3-dev

After adjusting the headers in Makefile, the CUDA code can be compiled and run

root@7816e1643c2a:/home/cuZK/test# make
...
root@7816e1643c2a:/home/cuZK/test# ls
BLS377         MSMtestbn.cu   Makefile      core          msmtesta  testb       testbn.cu
MSMtestbls.cu  MSMtestmnt.cu  libgmp.a     msmtestb  testbls.cu  testmnt.cu
root@7816e1643c2a:/home/cuZK/test# ./msmtestb
Please enter the MSM scales (e.g. 20 represents 2^20) 

Published by mskorski

Scientist, Consultant, Learning Enthusiast

Leave a comment

Your email address will not be published. Required fields are marked *