How to troubleshoot longhorn volume stk?
There is a bug in Longhorn that causes expansion.
After resizing a PVC, volume gets stuck in expanding but to a lower number:
Expansion Error: The expected size of engine should not be smaller than the current size . You can cancel the expansion to avoid volume crash.
If tried to cancel expansion, it throws this error:
- Unable to cancel expansion for volume : volume expansion is not started
The only resolution at this point is to delete the PVC and rebuild.
If this is a rook-ceph OSD and there are other OSDs to rebuild from, do the following:
Resolution: In such case, any of the other healthy OSD can be used to sync the data into re-created OSD. Procedure to rebuild an PV backed OSD
- Disable self-heal for fabric-installer and rook application
kubectl -n argocd patch application fabric-installer --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'
kubectl -n argocd patch application rook-ceph-operator --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'
kubectl -n argocd patch application rook-ceph-object-store --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'
- Scale down operator
kubectl -n rook-ceph scale --replicas=0 deploy/rook-ceph-operator
- Get PVC name corresponding to the crashing/corrupted OSD
kubectl -n rook-ceph get deploy --show-labels
e.g
kubectl -n rook-ceph get deploy rook-ceph-osd-0 --show-labels
# Look for labels that starts with set1 , this is the PVC name needed to be deleted
- Scale down crashing/ corrupted OSD
kubectl -n rook-ceph scale --replicas=0 deploy
e.g
kubectl -n rook-ceph scale --replicas=0 deploy rook-ceph-osd-0
- Delete crashing/corrupted OSD PVC
kubectl -n rook-ceph delete pvc
- Delete Crashing OSD deployment
kubectl -n rook-ceph delete deploy
e.g
kubectl -n rook-ceph delete deploy rook-ceph-osd-0
- Scale up rook operator
kubectl -n rook-ceph scale --replicas=1 deploy/rook-ceph-operator
- Wait until the new replacement OSD is created.
kubectl -n rook-ceph get pods -w
- Remove old OSD from ceph cluster
#Mark OSD out
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd out osd.
#Purge OSD
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd purge osd. --force --yes-i-really-mean-it
The procedure is almost same for raw device backed OSD except for deletion of PVC, cleanup the raw device. To find the raw device being used by corrupted OSD
- Get OSD ID
kubectl -n rook-ceph get deploy --show-labels
e.g
kubectl -n rook-ceph get deploy rook-ceph-osd-0 --show-labels
# Look for `ceph-osd-id` (In this case it is 0)
- Find OSD uuid
find /var/lib/rook/ -name "whoami" | xargs -I{} bash -c '[[ "$(cat {})" -eq ]] && ls -l "$(dirname {})"/block'
e.g
find /var/lib/rook/ -name "whoami" | xargs -I{} bash -c '[[ "$(cat {})" -eq 0 ]] && ls -l "$(dirname {})"/block'
- Find block device name. In the given screenshot the device name is sde
lsblk | grep '' -B1
e.g
lsblk | grep '7bb96604' -B1
- Clean up device
sgdisk --zap-all
dd if=/dev/zero of="DEVICE_PATH" bs=1M count=100 oflag=direct,dsync
blkdiscard
ls /dev/mapper/ceph-* | grep '' | xargs -I% -- dmsetup remove %
rm -rf /dev/ceph--*
e.g
sgdisk --zap-all /dev/sde
dd if=/dev/zero of=/dev/sde bs=1M count=100 oflag=direct,dsync
blkdiscard /dev/sde
ls /dev/mapper/ceph-* | grep '7bb96604' | xargs -I% -- dmsetup remove %
rm -rf /dev/ceph-7bb96604-*