ML Skills deployment with GPU

Robert_Isajuk · July 27, 2021, 4:43am

During deployment of multiple ML Skills, it seems that only one of the skills can have GPU Activated.
Is this expected behavior or installation is missing some components (for example: docker pull nvidia/k8s-device-plugin:1.9)
On prem online installation behind Proxy. Error message: Unschedulable 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:03:00.0 Off |                  Off |
| 39%   44C    P8    22W / 150W |      3MiB /  8126MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Jeremy_Tederry · July 27, 2021, 7:21am

Yes that’s right, GPU are not sharable among several pods. So when you deploy the first Skill or pipeline with GPU we “mount” the GPU with this pod so we are sure that the GPU will be always available for this Skill/pod.

More doc on this: Schedule GPUs | Kubernetes (it is not possible to request a fraction of a GPU)

system · July 30, 2021, 7:22am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What is the usage of GPU? AI Center question , ai_center	2	1371	August 8, 2022
CUDNN Status Execution Failed Knowledge Base ai_center , 2020_10_4	0	1019	September 14, 2021
Insufficient License for tenant for creating new ML Skill in AI Fabric AI Center	16	2536	April 8, 2022
Insufficient licenses available on this tenant, make sure you have licenses available before intending ML Skill deployment AI Center question , ai_center	4	877	June 24, 2022
AI Center - ML Skill Failed / Kubernetes operation failed to create deployment Automation Starter question , ai_center	13	1949	July 8, 2022

Most Active Users - Yesterday
ashokkarale
MD_Farhan1
Ajay_Mishra
postwick
Dheerendra_vishwakarma
Anil_G
chandreshsinh.jadeja
Gautham_Pattabiraman
vrdabberu
aravindbalineni123
More details...

ML Skills deployment with GPU

Related Topics