How to troubleshoot AI Center ML Skill issues (especially ones that happen during Skill deployment or during prediction) ?
There are mainly two areas that could go wrong when it comes to ML Skills.
- The first one is when a Skill is created / modified so that it is deployed to be available for use.
- The second one is when a Skill is actually used, which is often termed as "the Skill is being called for prediction".
Follow this guide to collect information for troubleshooting when facing an error in either of these situations,
Issues During Deployment
- Check ML Logs from the AI Center application. There should be [Error] / [Warn] logs. [Warn] logs may have more details compared to [Error] logs so make sure to check both.
- If nothing useful is found in the ML Logs, check ai-deployer logs from the Linux machine. In order to get the exact ai-deployer pod name, run the following command in the terminal:
- kubectl -n aifabric get pods
One of the pods that are listed as a result of this command will look like "ai-deployer-deployment-xxxxxxxxxx-xxxxx". This is the ai-deployer pod name. To get the logs from this pod, run the command below:
- kubectl -n aibfabric logs -f
The logs displayed using this command may reveal useful information regarding the issue. I may help to deploy the ML Skill again after this command is run, because the -f means that it will stream logs in real time.
- If no issues exist within the ai-deployer pod, it may be that a pod was deployed properly but there were issues after pod creation. The status of the pod that was deployed should be checked in this case. In order to find the right pod, run the following command to search for all pods:
- kubectl get pods -A
Find the pod name that corresponds to the id of the skill, a hashed value that looks like "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-x-xxxxxxxxxx-xxxxx". The hash to the left of the name (under the NAMESPACE column) that looks like "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" is the tenant namespace, which corresponds to the tenant id. The AGE column will display how long it has been since the pod was created, which is an important hint when identifying which pod corresponds to which skill. There is also a more systematic approach to identifying the pod (using the Network HAR trace in the browser to find the tenant id and skill id), which is explained the section below.
If the STATUS of the pod is anything other than RUNNING, use the command below to see the error:
- kubectl -n describe pod
- The steps below are for checking for Docker images (built images or pulled base images) when necessary:
Run the command below to look for the ai-deployer pod name:
- kubectl -n aifabric get pods
Go inside the deployer pod:
- kubectl -n aifabric exec -it -- sh
Look for the image to be validated:
- docker images
Go inside the the Docker image:
- docker run -it /bin/bash
Check for packages installed in the image:
- pip freeze
Issues During Prediction
More than half of this has already been covered in the section above. Monitoring real time logs from the pod is useful when trying to inspect what exactly is going on inside the pod when the ML Skill is being call during prediction (for example, from a Studio activity).
The most efficient way to indentify a pod corresponding to a Skill is to:
- Go to the AI Center app
- Go to the Network tab in Dev Tools(F12) in the browser
- Navigate to the ML Skills page
- Click the mlskills call
- Go to the Preview tab
- Find the right ML Skill in the list and look for the tenantId and id. TenantId is the namespace and id is the pod name.
Once this information is obtained, the logs from the pod can be streamed live by running the following command in the Linux machine:
- kubectl -n logs -f
Read more on the ML Skills issues - Basic Debugging .