這邊紀錄使用NVIDIA k8s-device-plugin遇到的坑,如果看官網描述會感覺很簡單,大概幾個步驟安裝就結束了,我是真的遇到一堆坑花了很多時間才解決。
第一坑遇到了安裝完後,無法看到GPU資源的問題,中間我換過不同版本的Plugin,也試著自己Docker build plugin,但都無法解決問題,這時候看Plugin錯誤訊息如下,顯示無法載入NVML模組。
# 錯誤訊息
Loading NVML
Failed to initialize NVML: could not load NVML library.
If this is a GPU node, did you set the docker default runtime to `nvidia`?
You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
後來跟大神友人討論之後,他懷疑是CRI-O的問題,結果還真的是這問題,詳情請看這邊,接著我重建Cluster,不使用CRI-O就可以找到GPU資源,所以如果安裝完後找不到GPU資源,那可能是這個原因,以下就簡單紀錄安裝過程。
步驟一:確定環境狀況
1. 確定k8s不是用CRI-O。
2. 在GPU的Node上安裝Nvidia相關的Driver和Cuda。
步驟二:安裝Nvidia Docker
安裝Nvidia Docker後,Container裡面可以不用安裝Cuda,只需要裝Toolkit就可以使用Cuda,如下圖,這邊直接參考Nvidia官方的安裝方法即可,只要跑Container起來的時候,可以正常顯示nvidia-smi資訊就代表成功了。
步驟三:驗證Default Docker Runtime
這邊要確定Docker runtime預設是不是跑nvidia,照理說在步驟二安裝過程會修改檔案"/etc/docker/daemon.json",異動部分如下,為確保沒問題建議還是加上default-runtime,或是出問題的時候加進去。
{
"nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia" }
如果有修改重啟Docker後檢查Runtime
$ docker info | grep Runtime
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Default Runtime: nvidia
步驟五:安裝k8s-device-plugin
直接用kubectl安裝,有需要其他版本都可以在Github找到,Plugin裝在GPU Node即可,所以要更進階的設定就是抓下來再安裝。
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml
Non-terminated Pods: (9 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- default gpu-pod 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28h kube-system coredns-64897985d-cs6bf 100m (1%) 0 (0%) 70Mi (0%) 170Mi (1%) 30h kube-system coredns-64897985d-h9c7l 100m (1%) 0 (0%) 70Mi (0%) 170Mi (1%) 30h kube-system etcd-k8s01.lnw.ai 100m (1%) 0 (0%) 100Mi (0%) 0 (0%) 30h kube-system kube-apiserver-k8s01.lnw.ai 250m (3%) 0 (0%) 0 (0%) 0 (0%) 30h kube-system kube-controller-manager-k8s01.lnw.ai 200m (2%) 0 (0%) 0 (0%) 0 (0%) 30h kube-system kube-proxy-jsrjb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30h kube-system kube-scheduler-k8s01.lnw.ai 100m (1%) 0 (0%) 0 (0%) 0 (0%) 30h kube-system nvidia-device-plugin-daemonset-x7tvs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 850m (10%) 0 (0%) memory 240Mi (1%) 340Mi (2%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0
步驟六:測試Pods
我是用這篇裡面的範例去測試,當我跑下去的時候遇到第二坑,跑下去後查看Pod log會出現資源不足的錯誤訊息,如下
0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {node.kuberreachable: }, that the pod didn't tolerate.ㄒ想說ㄛㄨㄨ想ㄤㄧ
然後查看plugin log發現錯誤訊息,原來是說我這張GPU太舊( NVIDIA GeForce GTX 770)不支援,所以健康狀態被標註為不健康,想說完了沒搞頭了..
> kubectl logs -f -n kube-system nvidia-device-plugin-daemonset-9qtjm
2022/01/11 03:49:28 Loading NVML
2022/01/11 03:49:28 Starting FS watcher.
2022/01/11 03:49:28 Starting OS watcher.
2022/01/11 03:49:28 Retreiving plugins.
2022/01/11 03:49:28 Starting GRPC server for 'nvidia.com/gpu'
2022/01/11 03:49:28 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/01/11 03:49:28 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2022/01/11 03:49:28 Warning: GPU-664254a5-59c1-b89c-6292-2a866af8310e is too old to support healthchecking: nvml: Not Supported. Marking it unhealthy.
2022/01/11 03:49:28 'nvidia.com/gpu' device marked unhealthy: GPU-664254a5-59c1-b89c-6292-2a866af8310e
接著求助於大神友人,他立馬找到解決方案(我懶了XD),請參考這篇文章內容,簡單說就是關閉健康檢查,這邊必須重新佈署Plugin,然後下載Plugin回來編輯檔案加入關閉健康檢查的參數。
#刪除Plugin $ kubectl delete -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml
將上面步驟的Plugin下載回來後,加入env那段,然後再重新Create Plugin。
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.10.0
name: nvidia-device-plugin-ctr
env:
- name: DP_DISABLE_HEALTHCHECKS
value: "xids"
args: ["--fail-on-init-error=false"]
$ kubectl -n kube-system describe pods nvidia-device-plugin-daemonset-mkd9k
...
...
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://dbb89d7dc5439e7255e06e4c139a7db302ba8f401103c3c32514921b3df50ad6
Image: nvcr.io/nvidia/k8s-device-plugin:v0.10.0
Image ID: docker-pullable://nvcr.io/nvidia/k8s-device-plugin@sha256:5b967b3e92900797a74e0f7cd71005747fa3154503986676f372b9d82fe1d898
Port: <none>
Host Port: <none>
Args:
--fail-on-init-error=false
State: Running
Started: Wed, 12 Jan 2022 09:35:41 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Wed, 12 Jan 2022 09:34:28 +0000
Finished: Wed, 12 Jan 2022 09:35:19 +0000
Ready: True
Restart Count: 3
Environment:
DP_DISABLE_HEALTHCHECKS: xids
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6lqn2 (ro)
...
...
spec:
restartPolicy: OnFailure
automountServiceAccountToken: false
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
env:
- name: DP_DISABLE_HEALTHCHECKS
value: "xids"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
查看Pod狀態,可以看到正常了,也有分配到資源。
$ kubectl describe pods cuda-vector-add ... Containers: cuda-vector-add: Container ID: docker://0b7bd12f8ebfc5a03e178765e62c09f1dccbc3f6cd9263ff371420c4cc321a88 Image: k8s.gcr.io/cuda-vector-add:v0.1 Image ID: docker-pullable://k8s.gcr.io/cuda-vector-add@sha256:0705cd690bc0abf54c0f0489d82bb846796586e9d087e9a93b5794576a456aea Port: <none> Host Port: <none> State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 12 Jan 2022 10:45:57 +0000 Finished: Wed, 12 Jan 2022 10:45:57 +0000 Ready: False Restart Count: 0 Limits: nvidia.com/gpu: 1 Requests: nvidia.com/gpu: 1 Environment: DP_DISABLE_HEALTHCHECKS: xids Mounts: <none> ..... .....
相關資源:
沒有留言:
張貼留言