How to deal with failure to remove orphaned Pod errors in kubelet logs
Issue Description
How to deal with Orphaned Pod errors in kubelet logs? Here is an example of what these errors may look like:
E1020 15:14:47.797900 2962 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"62618883-4e64-4019-a307-790926e4d539\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1 E1020 15:14:49.790202 2962 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"62618883-4e64-4019-a307-790926e4d539\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
These errors are typically harmless but may be a symptom of another problem. (It is possible it causes a spike in CPU usage by kubelet)
Root Cause
This is caused by the following kubernetes bug: https://github.com/kubernetes/kubernetes/pull/116134When a pod is removed, there is typically a process by which it unmounts volumes. If a pod is not termnated in a graceful way, or there is a problem with the storage volume, this error can occur (such an unexpected reboot, force reboot, system crash, etc) .
If this error occurs, we may want to do some investigation before resolving the errors.
Diagnosing/Resolving
- If we are seeing this error, we first want to collection some information to give us a hint of why its occurring. Please run the following short script.
-
script vol_data_check.txt podIds=$(sudo cat /var/lib/rancher/rke2/agent/logs/kubelet.log | grep -o -E 'orphaned pod \\"((\w|-)+)\\' | cut -d" " -f3 | grep -oE '(\w|-)+' | uniq) for podId in $podIds; do if [ -d "$path" ] ; then path="/var/lib/kubelet/pods/$podId/volumes" sudo cat /var/lib/kubelet/pods/$podId/volumes/kubernetes.io~csi/pvc-*/vol_data.json fi done sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pvc -A exit
- After the command is executed, there should be a file called vol_data_check.txt in the current directory.
- This will contain the output of the above commands and be helpful in debugging why the error was seen in the first place.
-
- After the above command is executed, please run the following to resolve the error:
-
script vol_data_removal.log podIds=$(sudo cat /var/lib/rancher/rke2/agent/logs/kubelet.log | grep -o -E 'orphaned pod \\"((\w|-)+)\\' | cut -d" " -f3 | grep -oE '(\w|-)+' | uniq) for podId in $podIds; do path="/var/lib/kubelet/pods/$podId/volumes" if [ -d "$path" ] ; then echo "Removing $path" sudo rm -rf "$path" fi done exit
-
- Finally, if opening a ticket with UiPath include the following:
- Support bundle: https://docs.uipath.com/automation-suite/automation-suite/2023.4/installation-guide/using-support-bundle-tool
- The files: vol_data_removal.log and vol_data_check.txt