Resolution when Pod exits with no error message.
Issue Description:
Pod exits with no error message and the Pod is in CrashLoopBackoff or Error state.
Root Cause:
This can have different causes but could indicate that another process is killing the application, or it could be a bug.
Resolution:
- First, describe the pod, and check to see if it's failing its health check. If the health check fails, that can trigger a container restart. Even if there are no errors in the logs, examine the logs to see why the health check was failing.
- kubectl -n describe pod
- Look for any messages with the Reason 'Unhealthy'.
- Here is an example:
- kubectl -n uipath describe pod orchestrator-7d87bb847-hgqkr
- As can be seen the pod did not past its health check and was restarted. Usually this will be paired with an error in the logs but it is not guaranteed.
- Note, there maybe other reason's why a pod would be considered unhealthy. But if the restarting behavior corresponds to an Unhealthy event, then its the Unhealthy event that needs to be examined.
- In such a scenario, search for other KBs or open a ticket with UiPath and include the support bundle: Using Support Bundle Tool
- Check the exit state of the pod. The exit state should give some indication of why the pod exited.
- kubectl -n get pod -o json | jq -r '.status.containerStatuses'
- If the above command returns 'null', then most likely it's an init container crashing, as such use the following command:
- kubectl -n get pod -o json | jq -r '.status.initContainerStatuses'
- The above command will output the status of the containers associated with the pod. For each container, check the exit code.
- For example, below is a screenshot of a sample output. In the output there are two containers listed. The first is istio-proxy and the second is Orchestrator. Both have an exit code of 137.
- Pay attention to both the exit code and the container name.
- Understanding the exit code can help to understand the issue. Linux exit codes are listed in Exit Codes
- For some exit codes, the exit code will be 128 + . For example, an exit code of 137 is 128 + 9 and indicates another process sent the RKE2 process a SIGNALKILL (9), which is an immediate kill.
- If the exit code is less than 128, recheck the pod logs but specify the container that failed.
- For example, in the below screenshot, notice that the container blkdevmapper failed with exit code 1. This container is an init container so the only way to see the logs is to include the '-c' option in the fetch log command. The fetch log command is: kubectl -n logs -c .
- i.e. kubectl -n rook-ceph logs rook-ceph-osd-0-7997db6c55-xmtm8 -c blkdevmapper
- If logs return, it means the wrong command was being used to fetch the logs. If no logs return or there is no error, that means the main process in the container is not logging correctly. In this case contact UiPath. When opening a support ticket, make sure to include a support bundle: Using Support Bundle Tool.
- For example, in the below screenshot, notice that the container blkdevmapper failed with exit code 1. This container is an init container so the only way to see the logs is to include the '-c' option in the fetch log command. The fetch log command is: kubectl -n logs -c .
- If the exit code is greater than 128, then check if a kill signal was sent. (Note, kubernetes will send kill signals in some scenarios, so it is important that step one was completed).
- The following command will add an entry in the auditd logs to record kill signals with the tag 'audit_kill'.
- auditctl -a exit,always -F arch=$(uname -m) -S kill -k audit_kill
- After adding the rule, verify that it exists:
- auditctl -l
- There should be an entry that looks something like (it may very depending on architecture): -a always,exit -F arch=b64 -S kill -F key=audit_kill
- Look up the start command of the failing container. This step can be skipped, but skipping it may make parsing through the audit logs difficult. (Alternatively, if the process is not being killed immediately, look up its PID and use that instead).
- For the failing pod, we have to find the startup command. The following query can be used.
- When the init container is failing: kubectl -n get pod -o json | jq -r '.spec.initContainers[0].command'
- When a runtime container is failing: kubectl -n get pod -o json | jq -r '.spec.Containers[0].command'
- The index of Containers or InitContainers needs to be adjusted to the container number.
- For example, in the first example, there were two failing containers. The first container was istio-proxy and the second was Orchestrator. In that case the index for each would be:
- istio-proxy: 0
- orchestrator: 1
- For example, in the first example, there were two failing containers. The first container was istio-proxy and the second was Orchestrator. In that case the index for each would be:
- If the search returns Null, see the note below.
- Note: The startup command might not be present in the spec. For example, Orchestrator pods won't have the startup command present. This can make it difficult to know what to search for in the Audit logs. In such a case, timestamps have to be used, or reach out to UiPath support to verify.
- For components that have a Windows msi version, the startup command is probably 'dotnet'
- For the failing pod, we have to find the startup command. The following query can be used.
- Retrigger the issue by waiting or deleting the pod.
- Query the audit logs for any kills signals, using the PID from the previous command.
- ausearch -k audit_kill | grep -A 1 -B 1
- If the startup command is not known just use: ausearch -k audit_kill
- There can be a lot of logs to sort through but they are timedstamp. The command used to check the container status can be used to get the timestamp for searching.
- Note: On some systems, the audit logs might rotate while this operation is being performed. If no entry for the kill signal is found, verify that the logs were not rotated recently.
- systemctl status auditd
- The last log rotation should be noted there in the log snippit at the bottom of the response.
- This can be compared to the time stamp of when RKE2 stopped.
- If another process killed the container the output of the audit log search will return a three-lined output:
- type=PROCTITLE msg=audit(1673744139.174:1056135): proctitle="-sh"
- type=OBJ_PID msg=audit(1673744139.174:1056135): opid=833725 oauid=-1 ouid=0 oses=-1 ocomm="rke2"
- type=SYSCALL msg=audit(1673744139.174:1056135): arch=c000003e syscall=62 success=yes exit=0 a0=cb8bd a1=9 a2=0 a3=7fa6f26ba9a0 items=0 ppid=312941 pid=570538 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=8 comm="sh" exe="/usr/bin/bash" key="audit_kill"
- In the above, the line OBJ_PID line is the process that was killed. The 'ocomm' field is the name of the process (this will usually be represented by the startup command) and the opid field is the PID of the process.
- The SYSCALL line represents the process that killed the container. The 'exe' field is the executable involved and the 'pid' field is the PID.
- In the example verify that the rke2 service was killed by the executable /usr/bin/bash (which was us simulating the issue using kill -9).
- ausearch -k audit_kill | grep -A 1 -B 1
- The following command will add an entry in the auditd logs to record kill signals with the tag 'audit_kill'.
- If none of the above helps identify the issue, generate a support bundle and open a ticket with UiPath.
- See: Using Support Bundle Tool
- In the ticket opened with UiPath, include the support bundle, the steps taken so for and any details discovered going through this KB article.