Parallel Computing - CUDA 环境搭建 / 基于开源 Kernel Driver
提要
本文内容所期望的目标是介绍 CUDA 环境的搭建过程,并结合 Nvidia 所开源的 KMD 软件,打通 CUDA APP 到硬件 GPU 的调用链路;
集成开源的 KMD 是一个可选的过程,其好处可以在于方便研究 CUDA 工作过程中 KMD 在其中的作用,起到学习与调试的目的;
在打通 APP 与 GPU 硬件的链路过程中,其中所需完成的节点如下图 1、2 所示,

实际操作
安装 Display Driver
官方驱动版本查询 - https://www.nvidia.com/en-us/drivers/

在我当前的 Ubuntu 及显卡硬件环境中,给我推荐的版本为 565,
这里直接下载后运行安装脚本即可,内核驱动部分我是选的MIT/GPL,为了稍后我自己的替换能够兼容;
替换 KMD「可选」
克隆开源项目 - https://github.com/NVIDIA/open-gpu-kernel-modules
切换至与 UMD 版本对应的分支,开始编译及安装,
make modules
make modules_install
安装完成后,对应的一些 KO 模块会被存放至/lib/modules/Your Version/kernel/drivers/video/路径下;
Display Driver 版本初步验证
通过 nvidia-smi 命令查看版本及显卡运行状态,
Sat Mar 1 19:29:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77 Driver Version: 565.77 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2060 ... Off | 00000000:01:00.0 On | N/A |
| 28% 34C P8 21W / 175W | 287MiB / 8192MiB | 3% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3187 G /usr/lib/xorg/Xorg 156MiB |
| 0 N/A N/A 3413 G /usr/bin/gnome-shell 27MiB |
| 0 N/A N/A 4225 G ...seed-version=20250226-180124.932000 69MiB |
| 0 N/A N/A 5172 G /proc/self/exe 25MiB |
| 0 N/A N/A 10291 C+G /usr/bin/nautilus 4MiB |
+-----------------------------------------------------------------------------------------+
CUDA 环境安装
官方安装包下载 - https://developer.nvidia.com/cuda-downloads
选择对应的架构及发行版后,最后的方案方式我选的deb(local);
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pinsudo
mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
CUDA 版本查看
安装包完成后,即可通过 nvcc 查看已安装版本(注意环境变量是否需添加/usr/local/cuda-12/bin),
#nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0
CUDA Sample 验证
Nvidia 准备了一系列的 Sample 仅开发者学习、体验,项目地址 - https://github.com/NVIDIA/cuda-samples.git
其中有两个基本的环境测试 Sample,分别是deviceQuery及bandwidthTest,
基于我的设备环境,其运行结果如下,
# deviceQuery
Samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 2060 SUPER"
CUDA Driver Version / Runtime Version 12.7 / 12.8
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7781 MBytes (8158707712 bytes)
(034) Multiprocessors, (064) CUDA Cores/MP: 2176 CUDA Cores
GPU Max Clock rate: 1650 MHz (1.65 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 65536 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.7, CUDA Runtime Version = 12.8, NumDevs = 1
Result = PASS
-----------------------------------
# bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA GeForce RTX 2060 SUPER
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 13.0
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 13.1
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 377.7
Result = PASS