How to troubleshoot Cluster Backups failing ?
Troubleshooting Partially Failed Backups in Velero
When dealing with partially failed backups in Velero, follow these steps to identify and resolve the issues:
- Inspect Velero Pod Logs
- Retrieve and analyze the Velero pod logs for potential issues causing the backup failure:
- kubectl -n velero logs
- Check for Backup Object Stuck
- Navigate to the installer directory and check the status of backups:
- cd /opt/UiPathAutomationSuite//installer
- bin/velero backup get
- Look for backups in the “PartiallyFailed” or “Failed” state:
- NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
- schedule-schedule-20231122042959 InProgress 0 0 2023-11-22 04:29:59 +0000 UTC 1d objectstore
- schedule-schedule-20231122010659 PartiallyFailed 17 0 2023-11-22 01:06:59 +0000 UTC 1d objectstore
schedule-schedule-20231121132814 Failed 0 0 2023-11-21 13:28:14 +0000 UTC 1d objectstore
schedule-schedule-20231121124514 Completed 0 0 2023-11-21 12:45:14 +0000 UTC 1d objectstore
- Describe PartiallyFailed or Failed Backup
- Describe the backup to get more details:
- bin/velero backup describe schedule-schedule-20231121145558
In certain scenarios, the Total items to be backed up and Items backed up might be the same, yet the backup failed. This is because this number doesn’t represent for Longhorn (Automation Suite Version <= 23.4.x) and object store backups.
There are multiple issues because of that this backup got failed
-
Longhorn NFS Server Access Issue
-
Check the Longhorn manager logs for errors related to NFS server access:
-
kubectl -n longhorn-system logs -l longhorn.io/managed-by=longhorn-manager | grep -i "Failed to create backup"
-
-
Resolve the NFS access issue based on the error details provided.
-
Unattached Volumes Issue
-
Review Velero pod logs for unattached volumes by checking for messages related to CSI driver reconciliation:
-
time="2023-11-23T05:30:56Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot uipath/velero-pushgateway-prometheus-pushgateway-sl2v2. Retrying in 5s" controller=backup logSource="pkg/controller/backup_controller.go:958"
time="2023-11-23T05:31:01Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot uipath/velero-pushgateway-prometheus-pushgateway-sl2v2. Retrying in 5s" controller=backup logSource="pkg/controller/backup_controller.go:958"
time="2023-11-23T05:31:06Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v1.BackupStorageLocationList"
time="2023-11-23T05:31:06Z" level=debug msg="skip enqueue object velero/objectstore due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:92" resource="*v1.BackupStorageLocationList"
time="2023-11-23T05:31:06Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot uipath/velero-pushgateway-prometheus-pushgateway-sl2v2. Retrying in 5s" controller=backup logSource="pkg/controller/backup_controller.go:958"
-
-
Identify the snapshot and PVC causing the issue:
-
kubectl get volumesnapshot -n uipath | grep velero-pushgateway-prometheus-pushgateway-sl2v2
velero-pushgateway-prometheus-pushgateway-sl2v2 false pushgateway-prometheus-pushgateway 1Gi longhorn snapcontent-3c648994-2d31-4b72-9b92-6296e44d0fc4 34m 34m
-
-
Clean up any inactive pods associated with the problematic PVC.
-
kubectl delete pod
-
-
-
Too Many Snapshots
-
Identify and address issues related to an excessive number of snapshots, if applicable, to prevent overload and potential failures
-
-
Disable schedule backup from longhorn UI
-
Delete all failed and partially failed Velero backups
-
Run the Longhorn cleanup script (longhorn_snapshot_cleanup.sh).
-
Snapshot Backup Timeout Issue
-
If the backup failed due to timeout, increase the snapshot timeout (default is 1 hour):
-
/configureUiPathAS.sh snapshot backup create --wait --csi-snapshot-timeout 3h
-
-
-
Check Backup Hook Pod Logs
-
If the object store backup failed with the error
-
Failed to copy: failed to open source object: NoSuchKey
-
-
This indicates that either another process is modifying the file/folder or the file is corrupted. Manually back up the remaining files by following these steps:
-
podname=$( kubectl get pods --no-headers -o custom-columns=":metadata.name" -n uipath-infra | grep backup-hook-service)
### Edit config map and exclude folder
kubectl -n uipath-infra edit cm backup-hook-script-cm
### search for line
rclone sync \"ceph:${bucket}\"
### Add argument
rclone sync "ceph:${bucket}" ... --exclude //**
### Restart backup-hook pods
kubectl -n uipath-infra delete pod $podname --force
-
-
Note: This issue is related to rclone. Investigate the root cause and solution, but the above steps can help unblock users in the meantime.
Backup Failure in 23.10.0 Cluster Post Upgrade (from versions <=22.10 with Apps Enabled)
Issue:
- After upgrading from versions <= 22.10.x to 23.10.x with apps enabled, backups fail if MongoDB is not deleted as a post-upgrade step.
Symptoms:
- When checking the Velero backup logs, errors similar to the following may be encountered:
-
cd /opt/UiPathAutomationSuite/installer bin/velero backup logs asbackup-manual-1701858330 | grep "error"
-
time="2023-12-06T10:25:34Z" level=debug msg="received EOF, stopping recv loop" backup=velero/asbackup-manual-1701858330 cmd=/plugins/local-volume-provider err="rpc error: code = Unimplemented desc = unknown service plugin.GRPCStdio" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio time="2023-12-06T10:26:13Z" level=info msg="1 errors encountered backup up item" backup=velero/asbackup-manual-1701858330 logSource="pkg/backup/backup.go:421" name=mongodb-replica-set-0 time="2023-12-06T10:26:13Z" level=error msg="Error backing up item" backup=velero/asbackup-manual-1701858330 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=mongodb, name=data-volume-mongodb-replica-set-0): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass longhorn-backup-single-replica: failed to get volumesnapshotclass for provisioner driver.longhorn.io, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="pkg/backup/backup.go:425" name=mongodb-replica-set-0 time="2023-12-06T10:28:13Z" level=info msg="1 errors encountered backup up item" backup=velero/asbackup-manual-1701858330 logSource="pkg/backup/backup.go:421" name=logs-volume-mongodb-replica-set-0 time="2023-12-06T10:28:13Z" level=error msg="Error backing up item" backup=velero/asbackup-manual-1701858330 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=mongodb, name=logs-volume-mongodb-replica-set-0): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass longhorn-backup-single-replica: failed to get volumesnapshotclass for provisioner driver.longhorn.io, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:325" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:425" name=logs-volume-mongodb-replica-set-0
-
Resolution:
Uninstall the mongodb following below doc and after this, take backup and it should start working.