How to debug issues using Argo CD?
Issue Description: How to debug issues using Argo CD?
Background: ArgoCD is a continuous deployment tool used in Kubernetes clusters. As it pertains to UiPath, it is used for two purposes:
- Use to deploy our software into a running cluster
- Use as visualization tool that can help with debugging.
ArgoCD tries to be very intuitive in its layout. While the following steps may seem very detailed, it would only take an experienced user a few minutes to check everything that is outlined. Overall, ArgoCD can be a powerful way of quickly finding the cause of a problem.
Debugging with Argo CD:
- Get the credentials: Managing The Cluster In ArgoCD
- sudo su -
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" \
&& export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl get secrets/argocd-admin-password -n argocd \ -o "jsonpath={.data['password']}" | echo $(base64 -d)
- sudo su -
- Go to the alm endpoint (alm.)
- The username is admin and the password is the password returned by the credentials command.
- In Argo, check for applications in a degraded state. In the below screenshot we have filtered by degraded and we can see that fabric-installer is degraded.
- If there are degraded applications, click on the degraded application. If there are no degraded applications, try searching for 'Progressing' and then click on the progressing application. If there are no degraded or progressing, trying 'Missing'. If that returns nothing, uncheck all search options and take a screenshot and open a ticket with UiPath. (This should only happen in a failed install scenario, so try checking our KB articles related to installation).
- If Items are missing, try syncing the app. (Just click the Sync button on the top right of an app)
- Within the degraded app, search again for 'Degraded' items (or 'Progressing')
- In the below screenshot we can see that within the 'fabric-installer' there are 'degraded applications'. (if there had been no degraded applications, search for progressing items).
- In the above screenshot, see that there are a few degraded applications. It might be that instead of degraded applications, there are degraded pods. For applications, choose one degraded item and click on the pop-out icon.
- Eventually following the degraded application, the observation may be a pod is degraded or the application resource is degraded and a pod is in progressing. At this point, look at the logs for the degraded (or progressing) item. Here is an example:
- In the above example, observe that 'rook-ceph-object-store' is degraded, that 'rook-ceph' is also degraded, that the deployment 'rook-ceph-rgw-rook-ceph-a' is progressing and so is the corresponding replicated set (the items in the black box), and finally that the pod 'rook-ceph-rg-rook-ceph-a-6f4..' is also degraded.
- It is important to note that the in these scenarios the root item (rook-ceph-object-store) is degraded. This root item represents the overall application. Typically the root item is not as important to examine.
- Additionally when the root application is degraded, there is usually an item within the application that is progressing or degraded as well. (If that is not the case, then probably the application is about to become healthy).
- Each of the items outlined above represent a resource type. Each resource has additional information we can check. But before we do that, some things to know:
- If the resource is a parent of an item (indicated by the tracing lines) and it is in progressing but its child is degraded (or progressing), look at the child.
- If the child items are healthy then look at the parent item.
- Pods will always have logs. Some resources have child pods and examining the parent that manages the pod will allow to aggregate the logs from all the pods they manage.
- Pods are a collection of containers. So when looking at the logs and looking at the individual container logs for each container in the collection (pod).
- The general paradigm for checking things is as follows:
- Follow the degraded items until we get to the lowest level of child items.
- Check if the item has logs, if it does examine them.
- Check if the items has events, if so examine them.
- Check the status in the live manifest.
- If the issue is not clear, check the parent item that is closed to the root application.
- In the above example, observe in the logs of the pod 'rook-ceph-rg-rook-ceph-a-6f4..'. This is because it is the lowest child item from the root that is degraded.
- Check if the resource has logs. (Since it is a pod it will have logs). If the item does not have logs, then move on to checking the events.
- Click the three-dotted icon and select 'Details'
- Alternatively click the 3-dotted icon for the resource type 'deploy' (deployment) with name rook-ceph-rgw-ceph-a or the resource type 'rs' (replicated set) with resource name 'rook-ceph-rg-rook-ceph-a-6f4..'. Both of these items help manage the 'rook-ceph-rg-rook-ceph-a-6f4..' pod and can be used to view their logs.
- When clicking the three-dotted icon, there will be an option for the logs.
- Got to the 'Logs' tab.
- Check the logs, starting with the items marked 'Containers'.
- Click the three-dotted icon and select 'Details'
- In the above example, navigate to the logs selected the RGW container logs.
- As a side note, in this example there is not errors because what is actually happening is that the container is waiting for a condition that will never be met.
- Sometimes the logs might show a message that says something like 'PodInitializing'. This means the container has not started yet and there are no logs to see. In such cases, this is when checking the 'INIT CONTAINERS' logs. When there are multiple init container's start on the lowest one and work up until we see one with logs being generated. The Init containers are started in sequential order and cannot start until the one above it has started. This generally applies to the containers listed under Containers. Here is an example
- In the above example, the items under 'CONTAINERS' showed 'PodInitializing'. Go to the the 'INIT CONTAINERS' section. Check the container 'ISTIO-INIT' , verify if it shows PodInitializing. Next navigate to the 'DOWNLOAD-PLUGINS' container. Here observe the logs and from this it is seen that the container is stalled trying to performance some operation.
- Usually in these logs, an exception is found that help explains the issue. Sometimes, that might not be case. The logs would still be relevant, but it might be necessary to contact UiPath Support to better understand the issue.
- If everything is in the state 'PodInitializing' then check events.
- After checking logs of the resource (if it has them) we would next check the events.
- Again click on the three dotted icon and then select 'Details'
- Go to the 'Events'
- Here is an example:
- In the above example, observe that the startup probe failed. In the example, since the Logs were being generated, this is what is expected to see. As mentioned above, the container is waiting for a condition that is never met. This in turn means that it is startup probe will always fail.
- In other scenarios, there can be very important information in these events that will explain why a container is not starting, or that some other condition in the cluster is not met.
- After checking the logs or events (if either is present) next check the live manifest.
- Again click on the three dotted icon and then select 'Details'
- Go to the 'Summary'.
- Under the Summary tab, look at the 'LIVE MANIFEST'
- Scroll to the very bottom and check for the section titled 'Status'
- In this example, we see the container is exiting with a status code of 137. In this case this is because the health probe is failing. But if this were a scenario where there was no health probe failure in the events, this could indicate that anti-virus is killing the pod, or that there resource contention in the cluster.
- Finally, if the issue is not identified, check the parent item that is closed to the application that is also in a degraded state.
- Going back to our example, look at the logs for the pod 'rook-ceph-rg-rook-ceph-a-6f4..' but the issue is not understood. As such, go to the parent item that is degraded. In this case that is the item of type 'cephobjectstore' called rook-ceph. (Note not to examine the root item, even though it is the parent of the the rook-ceph item and degraded as well. Since it is an application it's manifest wont be as helpful). Here is a screenshot of the item we want to look at:
- In this case, do the same thing, check to see if it has logs, events and finally check the manifest. There may be no new information in the Logs or Events. But under the manifest the following may be seen,
- Following the steps so far, a clear issue has not emerged. This will sometimes happen. In this case reach out to UiPath support, explain what has been examined, and also share a support bundle. However, in most scenarios, following these steps will at least result in finding specific error or status message. This will either help to explain the issue or allow us to check the UiPath KB article database to see if there is a support article regarding the problem.
- Additionally, for the issue above, the key to the problem is actually in the pod logs. The pod 'rook-ceph-rg-rook-ceph-a-6f4..' is actually trying to communicate with another service we deploy that was purposefully disabled for this example. For someone with familiarity with the rook-ceph component, the next step would have been to check the live manifest of the resource type cephcluster named 'rook-ceph'. In the live manifest, see the issue is that one of the services was scaled down.
- If the steps in this article do not end up being helpful, open a ticket with UiPath. Include the following:
- Any findings from using this article.
- Support bundle: Using Support Bundle Tool .