2022年1月13日 星期四

k8s | 使用NVIDIA k8s-device-plugin

這邊紀錄使用NVIDIA k8s-device-plugin遇到的坑,如果看官網描述會感覺很簡單,大概幾個步驟安裝就結束了,我是真的遇到一堆坑花了很多時間才解決。

第一坑遇到了安裝完後,無法看到GPU資源的問題,中間我換過不同版本的Plugin,也試著自己Docker build plugin,但都無法解決問題,這時候看Plugin錯誤訊息如下,顯示無法載入NVML模組。

# 錯誤訊息
Loading NVML
Failed to initialize NVML: could not load NVML library.
If this is a GPU node, did you set the docker default runtime to `nvidia`?
You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

後來跟大神友人討論之後,他懷疑是CRI-O的問題,結果還真的是這問題,詳情請看這邊,接著我重建Cluster,不使用CRI-O就可以找到GPU資源,所以如果安裝完後找不到GPU資源,那可能是這個原因,以下就簡單紀錄安裝過程。

步驟一:確定環境狀況

1. 確定k8s不是用CRI-O。

2. 在GPU的Node上安裝Nvidia相關的Driver和Cuda。


步驟二:安裝Nvidia Docker

安裝Nvidia Docker後,Container裡面可以不用安裝Cuda,只需要裝Toolkit就可以使用Cuda,如下圖,這邊直接參考Nvidia官方的安裝方法即可,只要跑Container起來的時候,可以正常顯示nvidia-smi資訊就代表成功了。



步驟三:驗證Default Docker Runtime

這邊要確定Docker runtime預設是不是跑nvidia,照理說在步驟二安裝過程會修改檔案"/etc/docker/daemon.json",異動部分如下,為確保沒問題建議還是加上default-runtime,或是出問題的時候加進去。

{
"nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia" }

如果有修改重啟Docker後檢查Runtime

$ docker info | grep Runtime
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: nvidia

步驟五:安裝k8s-device-plugin


直接用kubectl安裝,有需要其他版本都可以在Github找到,Plugin裝在GPU Node即可,所以要更進階的設定就是抓下來再安裝。

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml

查看是否成功, 查看Nodes狀態,kubectl describe nodes | less,可以看到plugin有佈署到Node上,Allocated resources也有顯示nvidia.com/gpu,如果在第一坑使用CRI-O的情況下,是無法看到nvidia.com/gpu出現。
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  default                     gpu-pod                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         28h
  kube-system                 coredns-64897985d-cs6bf                 100m (1%)     0 (0%)      70Mi (0%)        170Mi (1%)     30h
  kube-system                 coredns-64897985d-h9c7l                 100m (1%)     0 (0%)      70Mi (0%)        170Mi (1%)     30h
  kube-system                 etcd-k8s01.lnw.ai                       100m (1%)     0 (0%)      100Mi (0%)       0 (0%)         30h
  kube-system                 kube-apiserver-k8s01.lnw.ai             250m (3%)     0 (0%)      0 (0%)           0 (0%)         30h
  kube-system                 kube-controller-manager-k8s01.lnw.ai    200m (2%)     0 (0%)      0 (0%)           0 (0%)         30h
  kube-system                 kube-proxy-jsrjb                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         30h
  kube-system                 kube-scheduler-k8s01.lnw.ai             100m (1%)     0 (0%)      0 (0%)           0 (0%)         30h
  kube-system                 nvidia-device-plugin-daemonset-x7tvs    0 (0%)        0 (0%)      0 (0%)           0 (0%)         28h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                850m (10%)  0 (0%)
  memory             240Mi (1%)  340Mi (2%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     0           0


步驟六:測試Pods


我是用這篇裡面的範例去測試,當我跑下去的時候遇到第二坑,跑下去後查看Pod log會出現資源不足的錯誤訊息,如下

0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {node.kuberreachable: }, that the pod didn't tolerate.ㄒ想說ㄛㄨㄨ想ㄤㄧ


然後查看plugin log發現錯誤訊息,原來是說我這張GPU太舊( NVIDIA GeForce GTX 770)不支援,所以健康狀態被標註為不健康,想說完了沒搞頭了..

> kubectl logs -f -n kube-system nvidia-device-plugin-daemonset-9qtjm
2022/01/11 03:49:28 Loading NVML
2022/01/11 03:49:28 Starting FS watcher.
2022/01/11 03:49:28 Starting OS watcher.
2022/01/11 03:49:28 Retreiving plugins.
2022/01/11 03:49:28 Starting GRPC server for 'nvidia.com/gpu'
2022/01/11 03:49:28 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/01/11 03:49:28 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2022/01/11 03:49:28 Warning: GPU-664254a5-59c1-b89c-6292-2a866af8310e is too old to support healthchecking: nvml: Not Supported. Marking it unhealthy.
2022/01/11 03:49:28 'nvidia.com/gpu' device marked unhealthy: GPU-664254a5-59c1-b89c-6292-2a866af8310e


接著求助於大神友人,他立馬找到解決方案(我懶了XD),請參考這篇文章內容,簡單說就是關閉健康檢查,這邊必須重新佈署Plugin,然後下載Plugin回來編輯檔案加入關閉健康檢查的參數。

#刪除Plugin
$ kubectl delete -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml

將上面步驟的Plugin下載回來後,加入env那段,然後再重新Create Plugin。
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.10.0
        name: nvidia-device-plugin-ctr
        env:
        - name: DP_DISABLE_HEALTHCHECKS
          value: "xids"
        args: ["--fail-on-init-error=false"]

查看Plugin狀態看有沒有新增xids參數
$ kubectl -n kube-system describe pods nvidia-device-plugin-daemonset-mkd9k
...
...
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  docker://dbb89d7dc5439e7255e06e4c139a7db302ba8f401103c3c32514921b3df50ad6
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.10.0
    Image ID:      docker-pullable://nvcr.io/nvidia/k8s-device-plugin@sha256:5b967b3e92900797a74e0f7cd71005747fa3154503986676f372b9d82fe1d898
    Port:          <none>
    Host Port:     <none>
    Args:
      --fail-on-init-error=false
    State:          Running
      Started:      Wed, 12 Jan 2022 09:35:41 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 12 Jan 2022 09:34:28 +0000
      Finished:     Wed, 12 Jan 2022 09:35:19 +0000
    Ready:          True
    Restart Count:  3
    Environment:
      DP_DISABLE_HEALTHCHECKS:  xids
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6lqn2 (ro)
...
...

測試的Pod也要新增env這段,如下
spec:
  restartPolicy: OnFailure
  automountServiceAccountToken: false
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      env:
      - name: DP_DISABLE_HEALTHCHECKS
        value: "xids"
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU

查看Pod狀態,可以看到正常了,也有分配到資源。
$ kubectl describe pods cuda-vector-add 
...
Containers:
  cuda-vector-add:
    Container ID:   docker://0b7bd12f8ebfc5a03e178765e62c09f1dccbc3f6cd9263ff371420c4cc321a88
    Image:          k8s.gcr.io/cuda-vector-add:v0.1
    Image ID:       docker-pullable://k8s.gcr.io/cuda-vector-add@sha256:0705cd690bc0abf54c0f0489d82bb846796586e9d087e9a93b5794576a456aea
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 12 Jan 2022 10:45:57 +0000
      Finished:     Wed, 12 Jan 2022 10:45:57 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      DP_DISABLE_HEALTHCHECKS:  xids
    Mounts:                     <none>
.....
.....


相關資源:

開啟Nvidia Docker log
關閉GPU健康檢查



沒有留言: