Pods Of On-Prem AIC Installation Failing Due To GPU Error

Pods of on-prem AIC installation are failing due to GPU error. How to resolve this?

Pods are getting stuck in CrashLoopBackOff due to GPU error "nvml error: driver not loaded: unknown". To resolve the error follow below steps,

  1. SSH into the node where the pod is failing and where the GPU is enabled.
  2. Execute nvidia-smi command
  3. If above command throws any error like "NVIDIA-SMI has failed because it could not communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.", then there is issue with NVIDIA driver. Either the driver is having some issue or it has been upgraded. In both cases, reinstall the NVIDIA driver, specifically NVIDIA driver version 450.51.06 (for AIC v21.4 or lesser)
  4. If above command does not throw any error and shows the driver details, check the NVIDIA container runtime. Execute /usr/bin/nvidia-container-runtime command and see if a version is returned. If an error is thrown, reinstall the runtime.

For AIC v21.10 or greater: Follow steps in link below to reinstall the driver and container toolkit

AI Center - Installing A GPU Driver .