How to deal with failure to remove orphaned Pod errors in kubelet logs

How to deal with failure to remove orphaned Pod errors in kubelet logs

Issue Description

How to deal with Orphaned Pod errors in kubelet logs? Here is an example of what these errors may look like:

E1020 15:14:47.797900    2962 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"62618883-4e64-4019-a307-790926e4d539\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
E1020 15:14:49.790202    2962 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"62618883-4e64-4019-a307-790926e4d539\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1


These errors are typically harmless but may be a symptom of another problem. (It is possible it causes a spike in CPU usage by kubelet)

Root Cause

This is caused by the following kubernetes bug: https://github.com/kubernetes/kubernetes/pull/116134

When a pod is removed, there is typically a process by which it unmounts volumes. If a pod is not termnated in a graceful way, or there is a problem with the storage volume, this error can occur (such an unexpected reboot, force reboot, system crash, etc) .

If this error occurs, we may want to do some investigation before resolving the errors.

Diagnosing/Resolving

  1. If we are seeing this error, we first want to collection some information to give us a hint of why its occurring. Please run the following short script.
    1. script vol_data_check.txt
      podIds=$(sudo cat /var/lib/rancher/rke2/agent/logs/kubelet.log | grep -o  -E 'orphaned pod \\"((\w|-)+)\\' | cut -d" " -f3 | grep -oE '(\w|-)+' | uniq)
      for podId in $podIds; do
        if [ -d "$path" ] ; then
          path="/var/lib/kubelet/pods/$podId/volumes"
          sudo cat /var/lib/kubelet/pods/$podId/volumes/kubernetes.io~csi/pvc-*/vol_data.json
        fi
      done
      sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pvc -A
      exit
    2. After the command is executed, there should be a file called vol_data_check.txt in the current directory.
    3. This will contain the output of the above commands and be helpful in debugging why the error was seen in the first place.
  2. After the above command is executed, please run the following to resolve the error:
    1. script vol_data_removal.log
      podIds=$(sudo cat /var/lib/rancher/rke2/agent/logs/kubelet.log | grep -o  -E 'orphaned pod \\"((\w|-)+)\\' | cut -d" " -f3 | grep -oE '(\w|-)+' | uniq)
      for podId in $podIds; do
        path="/var/lib/kubelet/pods/$podId/volumes"
        if [ -d "$path" ] ; then
          echo "Removing $path"
          sudo rm -rf "$path"
        fi
      done
      exit
      
  3. Finally, if opening a ticket with UiPath include the following:
    1. Support bundle: https://docs.uipath.com/automation-suite/automation-suite/2023.4/installation-guide/using-support-bundle-tool
    2. The files: vol_data_removal.log and vol_data_check.txt