Parallel Computing - CUDA 环境搭建 / 基于开源 Kernel Driver

2025-03-02 • 12 分钟阅读 •

提要

本文内容所期望的目标是介绍 CUDA 环境的搭建过程，并结合 Nvidia 所开源的 KMD 软件，打通 CUDA APP 到硬件 GPU 的调用链路；

集成开源的 KMD 是一个可选的过程，其好处可以在于方便研究 CUDA 工作过程中 KMD 在其中的作用，起到学习与调试的目的；

在打通 APP 与 GPU 硬件的链路过程中，其中所需完成的节点如下图 1、2 所示，

实际操作

安装 Display Driver

官方驱动版本查询 - https://www.nvidia.com/en-us/drivers/

在我当前的 Ubuntu 及显卡硬件环境中，给我推荐的版本为 565，

这里直接下载后运行安装脚本即可，内核驱动部分我是选的MIT/GPL，为了稍后我自己的替换能够兼容；

替换 KMD「可选」

克隆开源项目 - https://github.com/NVIDIA/open-gpu-kernel-modules

切换至与 UMD 版本对应的分支，开始编译及安装，

make modules
make modules_install

安装完成后，对应的一些 KO 模块会被存放至/lib/modules/Your Version/kernel/drivers/video/路径下；

Display Driver 版本初步验证

通过 nvidia-smi 命令查看版本及显卡运行状态，

Sat Mar  1 19:29:33 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    Off |   00000000:01:00.0  On |                  N/A |
| 28%   34C    P8             21W /  175W |     287MiB /   8192MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3187      G   /usr/lib/xorg/Xorg                            156MiB |
|    0   N/A  N/A      3413      G   /usr/bin/gnome-shell                           27MiB |
|    0   N/A  N/A      4225      G   ...seed-version=20250226-180124.932000         69MiB |
|    0   N/A  N/A      5172      G   /proc/self/exe                                 25MiB |
|    0   N/A  N/A     10291    C+G   /usr/bin/nautilus                               4MiB |
+-----------------------------------------------------------------------------------------+

CUDA 环境安装

官方安装包下载 - https://developer.nvidia.com/cuda-downloads

选择对应的架构及发行版后，最后的方案方式我选的deb(local);

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pinsudo
mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

CUDA 版本查看

安装包完成后，即可通过 nvcc 查看已安装版本（注意环境变量是否需添加/usr/local/cuda-12/bin），

#nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

CUDA Sample 验证

Nvidia 准备了一系列的 Sample 仅开发者学习、体验，项目地址 - https://github.com/NVIDIA/cuda-samples.git

其中有两个基本的环境测试 Sample，分别是deviceQuery及bandwidthTest，

基于我的设备环境，其运行结果如下，

# deviceQuery

Samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 2060 SUPER"
  CUDA Driver Version / Runtime Version          12.7 / 12.8
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 7781 MBytes (8158707712 bytes)
  (034) Multiprocessors, (064) CUDA Cores/MP:    2176 CUDA Cores
  GPU Max Clock rate:                            1650 MHz (1.65 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.7, CUDA Runtime Version = 12.8, NumDevs = 1
Result = PASS
-----------------------------------
# bandwidthTest

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA GeForce RTX 2060 SUPER
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                        13.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                        13.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                        377.7

Result = PASS