What are the possible reasons why Pod is not starting, and how to troubleshoot?
Issue Description: How to troubleshooting a container or pod not starting.
Root Cause / Background: A container cannot start for multiple reason. It could be it cannot initialize or that it is starting but immediately fails. The following steps will help diagnose the issue.
Diagnosing
- There are a few ways to find if a pod is not starting
- The first is to use argocd. See: How To Debug Issues Using Argo CD?
- Specifically, looking for degraded apps will typically lead us to the issue. In the KB, it gives an example of tracing the failing component to the rook-ceph-rgw pod.
- The second way is to use kubectl commands.
- Log into one of the master nodes
- Enable kubectl: Enabling kubectl
- Run the following command:
-
kubectl get pods -A | grep -v Running | grep -v Complete
-
- The above command will return a list of pods that are not in the Running state. Some of the pods might simply be in the process of starting up.
- The first is to use argocd. See: How To Debug Issues Using Argo CD?
- Next, determine what states the pods are in.
- For viewing in argocd, the state is displayed in the UI as part of the icon for the pod.
- Here is an example of what this looks like in the KB example for argocd.
- For viewing in argocd, the state is displayed in the UI as part of the icon for the pod.
- In the above, the broken heart indicating it is unhealthy
- On the lower right hand side observe information about the state:
- The pod was created an hour ago.
- Currently its in a Running.
- However, the 0/1 means that it is not considered healthy or fully started.
- Additionally the number '13' is the number of restarts.
- Because kubernetes tries to be self healing. We will see the state go from: running: running -> error -> crashedbackloopoff -> running.
- If using the kubectl commands, the output of the display will show the state of the pod.
- Depending on the state there are actions we can take. The following is a list of states that we might see if a container is not starting:
- Pending - This state means the pod could not be scheduled.
- Argo: Check the events.
- Kubectl: kubectl -n describe pod
- ContainerCreating - A container stuck in this state usually points to an issue with pulling an image or containerd. The events should explain the issue.
- Argo: Check the events.
- Kubectl: kubectl -n describe pod
- CrashLoopBackOff
- Argo:
- Check the pod logs.
- If the logs do not end with an exception, check the events.
- If the events do not explain why the pod is crashing (typically if the events explain the issue it would be due to a failed health check) then check the container status. This is under Summary->Live Manifest. See the exit code. See: Pod Exits With No Error Message .
- If the exit code is greater than 127, see: Pod Exits With No Error Message .
- Kubectl:
- kubectl -n logs
- If the logs do not end with an exception, check the events: kubectl -n describe pod
- If the events do not explain why the pod is crashing (typically if the events explain the issue it would be due to a failed health check) then check the container status.
- If the exit code is greater than 127, see Pod Exits With No Error Message
- Argo:
- ImagePullBackOff and ErrImagePull
- Argo: Check the events.
- Kubectl: kubectl -n describe pod
- PodInitializing
- Argo: Check the events.
- Kubectl: kubectl -n describe pod
- Evicted
- Argo: Check the events.
- Kubectl: kubectl -n describe pod
- In most of these cases, check alerts. Evicted pods indicate a problem with the node.
- Pending - This state means the pod could not be scheduled.
Note:
Volume attachment failed Event:
Warning FailedAttachVolume 11s (x6 over 30s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-6459e119-f581-48a0-8f85-4f674b2ae9fb" : rpc error: code = DeadlineExceeded desc = volume pvc-6459e119-f581-48a0-8f85-4f674b2ae9fb failed to attach to node XXXXXXX
See: How To Fix Looping PVC?
ImagePullBackOff or ErrImagePull: See ImagePullBack Error In Airgapped Installation for more details.