模式选择
手动部署
优点
没有系统版本限制,只要能搜到驱动,参考 [[#1.1.1 是否支持大部分 linux 系统]]
缺点
gpu-operator
优点
新增节点自动安装驱动
方便驱动升级,只需要更新 helm 所有节点都可以升级
部署与维护简单
缺点
要求系统版本,参考[[#1.1.2 查看支持的系统]]
所有节点要求系统,内核统一
此文章使用gpu-operator ,部署,并且支持离线环境,但需要容器镜像仓库与对应系统的 yum 源
一, 组件介绍
启动顺序
1. gpu-operator 管理以下所有资源,如果发现不会自动创建组件了,一般gpu-operator出问题了
2. GFD、NFD,二者都是用于发现 Node 上的信息,并以 label 形式添加到 k8s node 对象上,特别是 GFD 会添加nvidia.com/gpu.present=true
标签表示该节点有 GPU,只有携带该标签的节点才会安装后续组件。
3. 然后则是 Driver Installer、Container Toolkit Installer 用于安装 GPU 驱动和 container toolkit。
4. 接下来这是 device-plugin 让 k8s 能感知到 GPU 资源信息便于调度和管理
5. 最后的 exporter 则是采集 GPU 监控并以 Prometheus Metrics 格式暴露,用于做 GPU 监控。
NVIDIA GPU Operator总共包含如下的几个组件:
NFD(Node Feature Discovery) :用于给节点打上某些标签,这些标签包括 cpu id、内核版本、操作系统版本、是不是 GPU 节点等,其中需要关注的标签是nvidia.com/gpu.present=true
,如果节点存在该标签,那么说明该节点是 GPU 节点。
GFD(GPU Feature Discovery) :用于收集节点的 GPU 设备属性(GPU 驱动版本、GPU型号等),并将这些属性以节点标签的方式透出。在k8s 集群中以 DaemonSet 方式部署,只有节点拥有标签nvidia.com/gpu.present=true
时,DaemonSet 控制的 Pod 才会在该节点上运行。
**NVIDIA Driver Installe :基于容器的方式在节点上安装 NVIDIA GPU 驱动,在 k8s 集群中以 DaemonSet 方式部署,只有节点拥有标签nvidia.com/gpu.present=true
时,DaemonSet 控制的 Pod 才会在该节点上运行。
**NVIDIA Container Toolkit Installer :能够实现在容器中使用 GPU 设备,主要自动修改Container Runtime 配置,比如/etc/containerd/config.toml
在 k8s 集群中以 DaemonSet 方式部署,同样的,只有节点拥有标签nvidia.com/gpu.present=true
时,DaemonSet 控制的 Pod 才会在该节点上运行。
NVIDIA Device Plugin :NVIDIA Device Plugin 用于实现将 GPU 设备以 Kubernetes 扩展资源的方式供用户使用,在 k8s 集群中以 DaemonSet 方式部署,只有节点拥有标签nvidia.com/gpu.present=true
时,DaemonSet 控制的 Pod 才会在该节点上运行。
DCGM Exporter :周期性的收集节点 GPU 设备的状态(当前温度、总的显存、已使用显存、使用率等)并暴露 Metrics,结合 Prometheus 和 Grafana 使用。在 k8s 集群中以DaemonSet 方式部署,只有节点拥有标签nvidia.com/gpu.present=true
时,DaemonSet 控制的 Pod 才会在该节点上运行。
二, 环境准备
1 驱动与系统
这里以 h20 显卡驱动为例
查看对应驱动信息
1.1 确认支持的系统
1.1.1 是否支持大部分 linux 系统
如果搜索不到 Linux 64-bit 这个系统的驱动说明要求系统限制,并不是所有 linux 都能使用
如果搜索不到建议选择 Linux 64-bit RHEL 9 查看驱动版本
rhel,rhcos与Ubuntu这三个系统大概率支持大部分驱动
1.1.2 查看支持的系统
这里查看支持的系统
搜索上面查到的版本
查不到的系统不要用
比如565.57.01-ubuntu22.04 只能使用ubuntu22.04
1.2 系统优化
1 2 3 4 apt update apt upgrade apt install linux-generic-hwe-22.04 apt-get install libx11-dev libxext-dev linux-headers-generic libvulkan1
2 镜像准备
按需上传到私有仓库
gpu-operator 版本使用的24.6.2 ,如果更改版本对应的镜像版本会有变化
1 2 3 4 5 6 7 8 9 10 11 crictl pull registry.k8s.io/nfd/node-feature-discovery:v0.16.3 crictl pull nvcr.io/nvidia/gpu-operator:v24.9.0 crictl pull nvcr.io/nvidia/cuda:12.6.1-base-ubi8 crictl pull nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10 crictl pull nvcr.io/nvidia/k8s/container-toolkit:v1.16.2-ubuntu20.04 crictl pull nvcr.io/nvidia/k8s-device-plugin:v0.16.2-ubi8 crictl pull nvcr.io/nvidia/cloud-native/k8s-cc-manager:v0.1.1 crictl pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04 crictl pull nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.8.0-ubuntu20.04 crictl pull nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.9 crictl pull nvcr.io/nvidia/driver:565.57.01-ubuntu22.04
三,部署
1 获取 chart 包
1 2 3 4 5 helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update helm pull nvidia/gpu-operator --version 24.6 .2 tar -zxvf gpu-operator-v24.6.2.tgz cd gpu-operator/
2 修改配置
2.1 NFD 配置修改
主要修改镜像,有墙,需要手动修改
1 2 3 4 vim charts/node-feature-discovery/values.yaml image: repository: registry.k8s.io/nfd/node-feature-discovery ...
2.2 gpu-operator
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 cat values.yaml operator: repository: nvcr.io/nvidia image: gpu-operator version: "v24.9.0" ...... driver: ...... repoConfig: configMapName: "ubuntu-apt" ...... hostPaths: rootFS: "/" driverInstallDir: "/run/" ...... toolkit: ...... env: - name: CONTAINERD_CONFIG value: "/etc/containerd/config.toml" - name: CONTAINERD_SOCKET value: "/run/containerd/containerd.sock" - name: CONTAINERD_RUNTIME_CLASS value: "nvidia" - name: CONTAINERD_SET_AS_DEFAULT value: "true"
apt源配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 apiVersion: v1 kind: ConfigMap metadata: name: ubuntu-apt namespace: gpu-operator data: source.list: | deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse kubectl apply -f ubuntu-apt.yaml
如果使用私有仓库需要修改镜像地址
1 2 3 sed -i 's/nvcr.io\/nvidia/www.you-domain.com/g' values.yaml sed -i 's/nvcr.io\/nvidia\/cloud-native/www.you-domain.com/g' values.yaml sed -i 's/nvcr.io\/nvidia\/k8s/www.you-domain.com/g' values.yaml
2.2.1 测试配置是否正常
1 2 kubectl create ns gpu-operator helm upgrade --install -n gpu-operator gpu-operator . --dry-run
3 安装
如果遇到不会自动创建nvidia-driver-daemonset 参考[[#1 RuntimeClass 版本不对]]
1 2 helm upgrade --install -n gpu-operator gpu-operator . --debug --wait kubectl get pod -n gpu-operator
看到以下信息说明没问题
1 2 3 4 5 kubectl logs -f -n gpu-operator nvidia-driver-daemonset-58zcf Starting NVIDIA persistence daemon... Starting NVIDIA fabric manager daemon... Mounting NVIDIA driver rootfs... Done, now waiting for signal
四, 查看与测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 kubectl get pod -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-d8wlq 1/1 Running 0 7m57s gpu-operator-85ddb88b46-l8nkq 1/1 Running 0 8m23s gpu-operator-node-feature-discovery-gc-79dcb7ffcd-dhplf 1/1 Running 0 8m23s gpu-operator-node-feature-discovery-master-bc65c956f-4xhfv 1/1 Running 0 8m23s gpu-operator-node-feature-discovery-worker-7s5xm 1/1 Running 0 8m23s gpu-operator-node-feature-discovery-worker-chk9s 1/1 Running 0 8m23s gpu-operator-node-feature-discovery-worker-kch4v 1/1 Running 0 8m23s gpu-operator-node-feature-discovery-worker-zsh28 1/1 Running 0 8m23s nvidia-container-toolkit-daemonset-tmxdd 1/1 Running 0 81s nvidia-cuda-validator-j5tjs 0/1 Completed 0 49s nvidia-dcgm-exporter-srb2v 1/1 Running 0 81s nvidia-device-plugin-daemonset-z7t78 1/1 Running 0 7m57s nvidia-driver-daemonset-965lp 1/1 Running 0 8m4s nvidia-mig-manager-l5njq 1/1 Running 0 21s nvidia-operator-validator-x7bnq 1/1 Running 0 7m57s
1 查看 gpu 资源
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 kubectl exec -it -n gpu-operator nvidia-driver-daemonset-965lp sh Tue Nov 5 07:23:52 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H20 On | 00000000:11:00.0 Off | 0 | | N/A 26C P0 72W / 500W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H20 On | 00000000:12:00.0 Off | 0 | | N/A 28C P0 71W / 500W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA H20 On | 00000000:3F:00.0 Off | 0 | | N/A 27C P0 72W / 500W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA H20 On | 00000000:42:00.0 Off | 0 | | N/A 26C P0 73W / 500W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA H20 On | 00000000:90:00.0 Off | 0 | | N/A 26C P0 72W / 500W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA H20 On | 00000000:91:00.0 Off | 0 | | N/A 28C P0 72W / 500W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA H20 On | 00000000:BB:00.0 Off | 0 | | N/A 28C P0 73W / 500W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA H20 On | 00000000:BE:00.0 Off | 0 | | N/A 26C P0 72W / 500W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
2 测试
可能创建不了 pod,参考[[#3 找不到 /dev/nvidiactl
]]修复
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 cat demo.yamlapiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: containers: - name: cuda-vectoradd image: www.you-domain.comcuda-sample:vectoradd-cuda11.7.1-ubuntu20.04 resources: limits: nvidia.com/gpu: 1 kubectl apply -f demo.yaml kubectl logs cuda-vectoradd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
五, 扩容节点
添加节点后会自动安装驱动
1 查看新节点 gpu 信息
1 2 3 4 5 6 7 8 9 kubectl exec -it -n gpu-operator nvidia-driver-daemonset-cvs9k sh Tue Nov 5 07:43:07 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. |
遇到问题
1 RuntimeClass 版本不对
1 2 3 4 5 ERROR controller.clusterpolicy-controller Reconciler error {"name" : "cluster-policy" , "namespace" : "" , "error" : "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\"" } sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
修改gpu-operator版本
1 2 3 4 operator: repository: nvcr.io/nvidia image: gpu-operator version: "v24.9.0"
2 内核问题
升级内核,要不然报错,我这里升级到了6.8.0-40-generic
1 2 3 4 5 6 Some index files failed to download. They have been ignored, or old ones used instead. Resolving Linux kernel version... Could not resolve Linux kernel version Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...
3 找不到 /dev/nvidiactl
注意
这里不知道是不是 bug,所有设备需要手动做软连接,后面不确定会不会修复,不要修改driverInstallDir
里的路径为/dev
,尝试过发现启动不起来,后面有进度更新下
1 ln -s /run/nvidia/driver/dev/nvidia* /dev/
4 其他问题