模式选择

手动部署

优点

  • 没有系统版本限制,只要能搜到驱动,参考 [[#1.1.1 是否支持大部分 linux 系统]]

缺点

  • 安装与后期维护复杂

gpu-operator

优点

  • 新增节点自动安装驱动
  • 方便驱动升级,只需要更新 helm 所有节点都可以升级
  • 部署与维护简单

缺点

  • 要求系统版本,参考[[#1.1.2 查看支持的系统]]
  • 所有节点要求系统,内核统一
此文章使用gpu-operator,部署,并且支持离线环境,但需要容器镜像仓库与对应系统的 yum 源

一, 组件介绍

启动顺序

1. gpu-operator 管理以下所有资源,如果发现不会自动创建组件了,一般gpu-operator出问题了
2. GFD、NFD,二者都是用于发现 Node 上的信息,并以 label 形式添加到 k8s node 对象上,特别是 GFD 会添加nvidia.com/gpu.present=true 标签表示该节点有 GPU,只有携带该标签的节点才会安装后续组件。
3. 然后则是 Driver Installer、Container Toolkit Installer 用于安装 GPU 驱动和 container toolkit。
4. 接下来这是 device-plugin 让 k8s 能感知到 GPU 资源信息便于调度和管理
5. 最后的 exporter 则是采集 GPU 监控并以 Prometheus Metrics 格式暴露,用于做 GPU 监控。

NVIDIA GPU Operator总共包含如下的几个组件:

  • NFD(Node Feature Discovery):用于给节点打上某些标签,这些标签包括 cpu id、内核版本、操作系统版本、是不是 GPU 节点等,其中需要关注的标签是nvidia.com/gpu.present=true,如果节点存在该标签,那么说明该节点是 GPU 节点。
  • GFD(GPU Feature Discovery):用于收集节点的 GPU 设备属性(GPU 驱动版本、GPU型号等),并将这些属性以节点标签的方式透出。在k8s 集群中以 DaemonSet 方式部署,只有节点拥有标签nvidia.com/gpu.present=true时,DaemonSet 控制的 Pod 才会在该节点上运行。
  • **NVIDIA Driver Installe:基于容器的方式在节点上安装 NVIDIA GPU 驱动,在 k8s 集群中以 DaemonSet 方式部署,只有节点拥有标签nvidia.com/gpu.present=true时,DaemonSet 控制的 Pod 才会在该节点上运行。
  • **NVIDIA Container Toolkit Installer:能够实现在容器中使用 GPU 设备,主要自动修改Container Runtime配置,比如/etc/containerd/config.toml在 k8s 集群中以 DaemonSet 方式部署,同样的,只有节点拥有标签nvidia.com/gpu.present=true时,DaemonSet 控制的 Pod 才会在该节点上运行。
  • NVIDIA Device Plugin:NVIDIA Device Plugin 用于实现将 GPU 设备以 Kubernetes 扩展资源的方式供用户使用,在 k8s 集群中以 DaemonSet 方式部署,只有节点拥有标签nvidia.com/gpu.present=true时,DaemonSet 控制的 Pod 才会在该节点上运行。
  • DCGM Exporter:周期性的收集节点 GPU 设备的状态(当前温度、总的显存、已使用显存、使用率等)并暴露 Metrics,结合 Prometheus 和 Grafana 使用。在 k8s 集群中以DaemonSet 方式部署,只有节点拥有标签nvidia.com/gpu.present=true时,DaemonSet 控制的 Pod 才会在该节点上运行。

二, 环境准备

1 驱动与系统

这里以 h20 显卡驱动为例

查看对应驱动信息

1.1 确认支持的系统

1.1.1 是否支持大部分 linux 系统

如果搜索不到 Linux 64-bit这个系统的驱动说明要求系统限制,并不是所有 linux 都能使用

image.png

如果搜索不到建议选择 Linux 64-bit RHEL 9 查看驱动版本

rhel,rhcos与Ubuntu这三个系统大概率支持大部分驱动

image.png

1.1.2 查看支持的系统

这里查看支持的系统
搜索上面查到的版本

查不到的系统不要用

比如565.57.01-ubuntu22.04只能使用ubuntu22.04

image.png

1.2 系统优化

1
2
3
4
apt update
apt upgrade
apt install linux-generic-hwe-22.04
apt-get install libx11-dev libxext-dev linux-headers-generic libvulkan1

2 镜像准备

按需上传到私有仓库
gpu-operator版本使用的24.6.2,如果更改版本对应的镜像版本会有变化

最主要是nvcr.io/nvidia/driver:565.57.01-ubuntu22.04 这个镜像,参考[[#1.1.2 查看支持的系统]]

1
2
3
4
5
6
7
8
9
10
11
crictl pull registry.k8s.io/nfd/node-feature-discovery:v0.16.3
crictl pull nvcr.io/nvidia/gpu-operator:v24.9.0
crictl pull nvcr.io/nvidia/cuda:12.6.1-base-ubi8
crictl pull nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10
crictl pull nvcr.io/nvidia/k8s/container-toolkit:v1.16.2-ubuntu20.04
crictl pull nvcr.io/nvidia/k8s-device-plugin:v0.16.2-ubi8
crictl pull nvcr.io/nvidia/cloud-native/k8s-cc-manager:v0.1.1
crictl pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04
crictl pull nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.8.0-ubuntu20.04
crictl pull nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.9
crictl pull nvcr.io/nvidia/driver:565.57.01-ubuntu22.04

三,部署

1 获取 chart 包

1
2
3
4
5
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
helm pull nvidia/gpu-operator --version 24.6.2
tar -zxvf gpu-operator-v24.6.2.tgz
cd gpu-operator/

2 修改配置

2.1 NFD 配置修改

主要修改镜像,有墙,需要手动修改
1
2
3
4
vim charts/node-feature-discovery/values.yaml
image:
repository: registry.k8s.io/nfd/node-feature-discovery
...

2.2 gpu-operator

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
cat values.yaml
operator:
repository: nvcr.io/nvidia
image: gpu-operator
version: "v24.9.0" #修改版本与镜像

......

driver:
......
repoConfig: # 使用私有 yum 仓库修改配置,就算不用也建议切换到阿里云 yum 源,默认国外 yum 源下载很慢
configMapName: "ubuntu-apt" #与下面ConfigMap名字对应

......

hostPaths:
rootFS: "/"
driverInstallDir: "/run/"

......

toolkit: # 如果使用container修改以下参数
......
env:
- name: CONTAINERD_CONFIG
value: "/etc/containerd/config.toml"
- name: CONTAINERD_SOCKET
value: "/run/containerd/containerd.sock"
- name: CONTAINERD_RUNTIME_CLASS
value: "nvidia"
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
apt源配置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: v1
kind: ConfigMap
metadata:
name: ubuntu-apt
namespace: gpu-operator
data:
source.list: |
deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse

kubectl apply -f ubuntu-apt.yaml
如果使用私有仓库需要修改镜像地址
1
2
3
sed -i 's/nvcr.io\/nvidia/www.you-domain.com/g' values.yaml
sed -i 's/nvcr.io\/nvidia\/cloud-native/www.you-domain.com/g' values.yaml
sed -i 's/nvcr.io\/nvidia\/k8s/www.you-domain.com/g' values.yaml

2.2.1 测试配置是否正常

1
2
kubectl create ns gpu-operator
helm upgrade --install -n gpu-operator gpu-operator . --dry-run

3 安装

如果遇到不会自动创建nvidia-driver-daemonset参考[[#1 RuntimeClass 版本不对]]
1
2
helm upgrade --install -n gpu-operator gpu-operator . --debug --wait
kubectl get pod -n gpu-operator
看到以下信息说明没问题
1
2
3
4
5
kubectl logs -f -n gpu-operator     nvidia-driver-daemonset-58zcf
Starting NVIDIA persistence daemon...
Starting NVIDIA fabric manager daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal

四, 查看与测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
kubectl get pod -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-d8wlq 1/1 Running 0 7m57s
gpu-operator-85ddb88b46-l8nkq 1/1 Running 0 8m23s
gpu-operator-node-feature-discovery-gc-79dcb7ffcd-dhplf 1/1 Running 0 8m23s
gpu-operator-node-feature-discovery-master-bc65c956f-4xhfv 1/1 Running 0 8m23s
gpu-operator-node-feature-discovery-worker-7s5xm 1/1 Running 0 8m23s
gpu-operator-node-feature-discovery-worker-chk9s 1/1 Running 0 8m23s
gpu-operator-node-feature-discovery-worker-kch4v 1/1 Running 0 8m23s
gpu-operator-node-feature-discovery-worker-zsh28 1/1 Running 0 8m23s
nvidia-container-toolkit-daemonset-tmxdd 1/1 Running 0 81s
nvidia-cuda-validator-j5tjs 0/1 Completed 0 49s
nvidia-dcgm-exporter-srb2v 1/1 Running 0 81s
nvidia-device-plugin-daemonset-z7t78 1/1 Running 0 7m57s
nvidia-driver-daemonset-965lp 1/1 Running 0 8m4s
nvidia-mig-manager-l5njq 1/1 Running 0 21s
nvidia-operator-validator-x7bnq 1/1 Running 0 7m57s

1 查看 gpu 资源

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
kubectl exec -it -n gpu-operator     nvidia-driver-daemonset-965lp sh
# nvidia-smi
Tue Nov 5 07:23:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H20 On | 00000000:11:00.0 Off | 0 |
| N/A 26C P0 72W / 500W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H20 On | 00000000:12:00.0 Off | 0 |
| N/A 28C P0 71W / 500W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H20 On | 00000000:3F:00.0 Off | 0 |
| N/A 27C P0 72W / 500W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H20 On | 00000000:42:00.0 Off | 0 |
| N/A 26C P0 73W / 500W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H20 On | 00000000:90:00.0 Off | 0 |
| N/A 26C P0 72W / 500W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H20 On | 00000000:91:00.0 Off | 0 |
| N/A 28C P0 72W / 500W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H20 On | 00000000:BB:00.0 Off | 0 |
| N/A 28C P0 73W / 500W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H20 On | 00000000:BE:00.0 Off | 0 |
| N/A 26C P0 72W / 500W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

2 测试

可能创建不了 pod,参考[[#3 找不到 /dev/nvidiactl]]修复
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
cat demo.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
containers:
- name: cuda-vectoradd
image: www.you-domain.comcuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
resources:
limits:
nvidia.com/gpu: 1
kubectl apply -f demo.yaml

# 看到以下日志说明正常
kubectl logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

五, 扩容节点

添加节点后会自动安装驱动

1 查看新节点 gpu 信息

1
2
3
4
5
6
7
8
9
kubectl exec -it -n gpu-operator     nvidia-driver-daemonset-cvs9k sh
# nvidia-smi
Tue Nov 5 07:43:07 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |

遇到问题

1 RuntimeClass 版本不对

1
2
3
4
5
ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

修改gpu-operator版本

1
2
3
4
operator:
repository: nvcr.io/nvidia
image: gpu-operator
version: "v24.9.0" # 不要用默认版本,目前这个版本部署没问题

2 内核问题

升级内核,要不然报错,我这里升级到了6.8.0-40-generic

1
2
3
4
5
6
Some index files failed to download. They have been ignored, or old ones used instead.
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

3 找不到 /dev/nvidiactl

注意

这里不知道是不是 bug,所有设备需要手动做软连接,后面不确定会不会修复,不要修改driverInstallDir里的路径为/dev,尝试过发现启动不起来,后面有进度更新下

1
ln -s /run/nvidia/driver/dev/nvidia* /dev/

4 其他问题