Pipeline Failing With Error: All Cuda Capable Devices Are Busy Or Unavailable

Resolution when pipeline fails with the Error "All Cuda capable devices are busy or unavailable" .

Root Cause: This error means either that something is misconfigured in the AI Center or the graphics card is not working correctly.

Diagnosing

  1. Run nvidia-smi
    1. This should return that no processes are running
    2. If processes are running that could be the cause of the issue. Make sure no other skills are currently using the GPU
    3. If the command returns an error, check nvidia documentation for how to fix the issue.
      • HINT; Sometimes driver updates can cause nvidia-smi to error.
  2. If no processes are running run the following command:
    1. nvidia-smi -q
    2. This command returns the licensing status. Make sure the GPU is licensed
  3. Check /var/log/message
    1. grep 'nvidia' /var/log/messages
    2. If there are errors, check nvidia documentation to see if its a known issue.
      • For example if its not getting a license this error will be present: nvidia-gridd: Failed to acquire/renew license from license server
  4. If everything from the nvidia side looks good, capture a support bundle and raise the issue to UiPath
    1. 21.4- : v2021.4 Support Docs
    2. 21.10+ : Using Support Bundle Tool
    3. Include the output of nvidia-smi and the nvidia-smi -q command.