Longhorn Volume Stuck In Expanding To A Smaller Size Than Actual Volume

How to troubleshoot longhorn volume stk?

There is a bug in Longhorn that causes expansion.

After resizing a PVC, volume gets stuck in expanding but to a lower number:

Expansion Error: The expected size of engine should not be smaller than the current size . You can cancel the expansion to avoid volume crash.

If tried to cancel expansion, it throws this error:

  • Unable to cancel expansion for volume : volume expansion is not started

The only resolution at this point is to delete the PVC and rebuild.


If this is a rook-ceph OSD and there are other OSDs to rebuild from, do the following:

Resolution: In such case, any of the other healthy OSD can be used to sync the data into re-created OSD. Procedure to rebuild an PV backed OSD

  1. Disable self-heal for fabric-installer and rook application

kubectl -n argocd patch application fabric-installer --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'

kubectl -n argocd patch application rook-ceph-operator --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'

kubectl -n argocd patch application rook-ceph-object-store --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'

  1. Scale down operator

kubectl -n rook-ceph scale --replicas=0 deploy/rook-ceph-operator

  1. Get PVC name corresponding to the crashing/corrupted OSD

kubectl -n rook-ceph get deploy --show-labels

e.g

kubectl -n rook-ceph get deploy rook-ceph-osd-0 --show-labels

# Look for labels that starts with set1 , this is the PVC name needed to be deleted

  1. Scale down crashing/ corrupted OSD

kubectl -n rook-ceph scale --replicas=0 deploy

e.g

kubectl -n rook-ceph scale --replicas=0 deploy rook-ceph-osd-0

  1. Delete crashing/corrupted OSD PVC

kubectl -n rook-ceph delete pvc

  1. Delete Crashing OSD deployment

kubectl -n rook-ceph delete deploy

e.g

kubectl -n rook-ceph delete deploy rook-ceph-osd-0

  1. Scale up rook operator

kubectl -n rook-ceph scale --replicas=1 deploy/rook-ceph-operator

  1. Wait until the new replacement OSD is created.

kubectl -n rook-ceph get pods -w

  1. Remove old OSD from ceph cluster

#Mark OSD out

kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd out osd.

#Purge OSD

kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd purge osd. --force --yes-i-really-mean-it

The procedure is almost same for raw device backed OSD except for deletion of PVC, cleanup the raw device. To find the raw device being used by corrupted OSD

  1. Get OSD ID

kubectl -n rook-ceph get deploy --show-labels

e.g

kubectl -n rook-ceph get deploy rook-ceph-osd-0 --show-labels

# Look for `ceph-osd-id` (In this case it is 0)

  1. Find OSD uuid

find /var/lib/rook/ -name "whoami" | xargs -I{} bash -c '[[ "$(cat {})" -eq ]] && ls -l "$(dirname {})"/block'

e.g

find /var/lib/rook/ -name "whoami" | xargs -I{} bash -c '[[ "$(cat {})" -eq 0 ]] && ls -l "$(dirname {})"/block'

  1. Find block device name. In the given screenshot the device name is sde

lsblk | grep '' -B1

e.g

lsblk | grep '7bb96604' -B1

  1. Clean up device

sgdisk --zap-all

dd if=/dev/zero of="DEVICE_PATH" bs=1M count=100 oflag=direct,dsync

blkdiscard

ls /dev/mapper/ceph-* | grep '' | xargs -I% -- dmsetup remove %

rm -rf /dev/ceph--*

e.g

sgdisk --zap-all /dev/sde

dd if=/dev/zero of=/dev/sde bs=1M count=100 oflag=direct,dsync

blkdiscard /dev/sde

ls /dev/mapper/ceph-* | grep '7bb96604' | xargs -I% -- dmsetup remove %

rm -rf /dev/ceph-7bb96604-*