Basic Debugging For Issues During ML Skill Deployment

system · December 29, 2022, 1:48pm

How to troubleshoot AI Center ML Skill issues (especially ones that happen during Skill deployment or during prediction) ?

There are mainly two areas that could go wrong when it comes to ML Skills.

The first one is when a Skill is created / modified so that it is deployed to be available for use.
The second one is when a Skill is actually used, which is often termed as "the Skill is being called for prediction".

Follow this guide to collect information for troubleshooting when facing an error in either of these situations,

Issues During Deployment

Check ML Logs from the AI Center application. There should be [Error] / [Warn] logs. [Warn] logs may have more details compared to [Error] logs so make sure to check both.

If nothing useful is found in the ML Logs, check ai-deployer logs from the Linux machine. In order to get the exact ai-deployer pod name, run the following command in the terminal:

kubectl -n aifabric get pods

One of the pods that are listed as a result of this command will look like "ai-deployer-deployment-xxxxxxxxxx-xxxxx". This is the ai-deployer pod name. To get the logs from this pod, run the command below:

kubectl -n aibfabric logs -f

The logs displayed using this command may reveal useful information regarding the issue. I may help to deploy the ML Skill again after this command is run, because the -f means that it will stream logs in real time.

If no issues exist within the ai-deployer pod, it may be that a pod was deployed properly but there were issues after pod creation. The status of the pod that was deployed should be checked in this case. In order to find the right pod, run the following command to search for all pods:

kubectl get pods -A

Find the pod name that corresponds to the id of the skill, a hashed value that looks like "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-x-xxxxxxxxxx-xxxxx". The hash to the left of the name (under the NAMESPACE column) that looks like "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" is the tenant namespace, which corresponds to the tenant id. The AGE column will display how long it has been since the pod was created, which is an important hint when identifying which pod corresponds to which skill. There is also a more systematic approach to identifying the pod (using the Network HAR trace in the browser to find the tenant id and skill id), which is explained the section below.

If the STATUS of the pod is anything other than RUNNING, use the command below to see the error:

kubectl -n describe pod

The steps below are for checking for Docker images (built images or pulled base images) when necessary:

Run the command below to look for the ai-deployer pod name:

kubectl -n aifabric get pods

Go inside the deployer pod:

kubectl -n aifabric exec -it -- sh

Look for the image to be validated:

docker images

Go inside the the Docker image:

docker run -it /bin/bash

Check for packages installed in the image:

pip freeze

Issues During Prediction

More than half of this has already been covered in the section above. Monitoring real time logs from the pod is useful when trying to inspect what exactly is going on inside the pod when the ML Skill is being call during prediction (for example, from a Studio activity).

The most efficient way to indentify a pod corresponding to a Skill is to:

Go to the AI Center app
Go to the Network tab in Dev Tools(F12) in the browser
Navigate to the ML Skills page
Click the mlskills call
Go to the Preview tab
Find the right ML Skill in the list and look for the tenantId and id. TenantId is the namespace and id is the pod name.

Once this information is obtained, the logs from the pod can be streamed live by running the following command in the Linux machine:

kubectl -n logs -f

Read more on the ML Skills issues - Basic Debugging .

Topic		Replies	Views
AI Center - ML Skill Failed / Kubernetes operation failed to create deployment Automation Starter question , ai_center	13	1949	July 8, 2022
ML Deployment Issue in AI Center AI Center question , ai_center	1	505	December 21, 2022
ML skill Error: The model is getting successfully trained but not deploying AI Center	7	49	April 12, 2024
Problem with deploying an ML Skill AI Center question , ai_center	2	996	January 23, 2022
Deploying Status Bug AI Center	2	788	March 8, 2022

Most Active Users - Yesterday
ashokkarale
MD_Farhan1
Ajay_Mishra
postwick
Dheerendra_vishwakarma
Anil_G
chandreshsinh.jadeja
Gautham_Pattabiraman
vrdabberu
aravindbalineni123
More details...

Basic Debugging For Issues During ML Skill Deployment

How to troubleshoot AI Center ML Skill issues (especially ones that happen during Skill deployment or during prediction) ?

Related Topics