Pods Of On-Prem AIC Installation Failing Due To GPU Error

system · May 3, 2023, 1:00pm

Pods of on-prem AIC installation are failing due to GPU error. How to resolve this?

Issue Description:
Pods are getting stuck in CrashLoopBackOff due to GPU error "nvml error: driver not loaded: unknown".

Resolution:

SSH into the node where the pod is failing and where the GPU is enabled.
Execute nvidia-smi command.
If the above command throws any error like "NVIDIA-SMI has failed because it could not communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.", then there is an issue with the NVIDIA driver. Either the driver is having some issue or it has been upgraded. In both cases, reinstall the NVIDIA driver, specifically NVIDIA driver version 450.51.06 (for AIC v21.4 or less).
If the above command does not throw any error and shows the driver details, check the NVIDIA container runtime. Execute /usr/bin/nvidia-container-runtime command and see if a version is returned. If an error is thrown, reinstall the runtime.

For AIC v21.10 or greater: Follow the steps in the link below to reinstall the driver and container toolkit

Topic		Replies	Views
Facing failed status in enterprice version in al center while ml skill deployment AI Center question , ai_center	11	411	November 2, 2023
ML Skills deployment with GPU AI Center question	2	1498	July 27, 2021
Pipeline Failing With Error: All Cuda Capable Devices Are Busy Or Unavailable Knowledge Base document_understanding	0	3642	January 3, 2023
CUDNN Status Execution Failed Knowledge Base ai_center	0	1070	September 14, 2021
Error in AI Center Custom MLPackage upload AI Center	1	651	May 4, 2023