Backup partially failed troubleshooting

system · January 3, 2025, 4:31pm

How to troubleshoot Cluster Backups failing ?

Troubleshooting Partially Failed Backups in Velero

When dealing with partially failed backups in Velero, follow these steps to identify and resolve the issues:

Inspect Velero Pod Logs
- Retrieve and analyze the Velero pod logs for potential issues causing the backup failure:
- kubectl -n velero logs
Check for Backup Object Stuck
- Navigate to the installer directory and check the status of backups:
- cd /opt/UiPathAutomationSuite//installer
- bin/velero backup get
- Look for backups in the “PartiallyFailed” or “Failed” state:
- NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
- schedule-schedule-20231122042959 InProgress 0 0 2023-11-22 04:29:59 +0000 UTC 1d objectstore
- schedule-schedule-20231122010659 PartiallyFailed 17 0 2023-11-22 01:06:59 +0000 UTC 1d objectstore
  schedule-schedule-20231121132814 Failed 0 0 2023-11-21 13:28:14 +0000 UTC 1d objectstore
  schedule-schedule-20231121124514 Completed 0 0 2023-11-21 12:45:14 +0000 UTC 1d objectstore
Describe PartiallyFailed or Failed Backup
- Describe the backup to get more details:
- bin/velero backup describe schedule-schedule-20231121145558

In certain scenarios, the Total items to be backed up and Items backed up might be the same, yet the backup failed. This is because this number doesn’t represent for Longhorn (Automation Suite Version <= 23.4.x) and object store backups.

There are multiple issues because of that this backup got failed

Longhorn NFS Server Access Issue
- Check the Longhorn manager logs for errors related to NFS server access:
  - kubectl -n longhorn-system logs -l longhorn.io/managed-by=longhorn-manager | grep -i "Failed to create backup"

Resolve the NFS access issue based on the error details provided.

Unattached Volumes Issue
1. Review Velero pod logs for unattached volumes by checking for messages related to CSI driver reconciliation:
  - time="2023-11-23T05:30:56Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot uipath/velero-pushgateway-prometheus-pushgateway-sl2v2. Retrying in 5s" controller=backup logSource="pkg/controller/backup_controller.go:958"
    time="2023-11-23T05:31:01Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot uipath/velero-pushgateway-prometheus-pushgateway-sl2v2. Retrying in 5s" controller=backup logSource="pkg/controller/backup_controller.go:958"
    time="2023-11-23T05:31:06Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v1.BackupStorageLocationList"
    time="2023-11-23T05:31:06Z" level=debug msg="skip enqueue object velero/objectstore due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:92" resource="*v1.BackupStorageLocationList"
    time="2023-11-23T05:31:06Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot uipath/velero-pushgateway-prometheus-pushgateway-sl2v2. Retrying in 5s" controller=backup logSource="pkg/controller/backup_controller.go:958"
2. Identify the snapshot and PVC causing the issue:
  - kubectl get volumesnapshot -n uipath | grep velero-pushgateway-prometheus-pushgateway-sl2v2
    
    velero-pushgateway-prometheus-pushgateway-sl2v2 false pushgateway-prometheus-pushgateway 1Gi longhorn snapcontent-3c648994-2d31-4b72-9b92-6296e44d0fc4 34m 34m
3. Clean up any inactive pods associated with the problematic PVC.
  - kubectl delete pod
Too Many Snapshots
1. Identify and address issues related to an excessive number of snapshots, if applicable, to prevent overload and potential failures

Disable schedule backup from longhorn UI
Delete all failed and partially failed Velero backups
- Automation Suite - Deleting A Backup

Run the Longhorn cleanup script (longhorn_snapshot_cleanup.sh).
- Backup failed due to toomanysnapshot error
Snapshot Backup Timeout Issue
- If the backup failed due to timeout, increase the snapshot timeout (default is 1 hour):
  - /configureUiPathAS.sh snapshot backup create --wait --csi-snapshot-timeout 3h
Check Backup Hook Pod Logs
- If the object store backup failed with the error
  - Failed to copy: failed to open source object: NoSuchKey
- This indicates that either another process is modifying the file/folder or the file is corrupted. Manually back up the remaining files by following these steps:
  - podname=$( kubectl get pods --no-headers -o custom-columns=":metadata.name" -n uipath-infra | grep backup-hook-service)
    
    ### Edit config map and exclude folder
    kubectl -n uipath-infra edit cm backup-hook-script-cm
    ### search for line
    rclone sync \"ceph:${bucket}\"
    ### Add argument
    rclone sync "ceph:${bucket}" ... --exclude //**
    ### Restart backup-hook pods
    kubectl -n uipath-infra delete pod $podname --force

Note: This issue is related to rclone. Investigate the root cause and solution, but the above steps can help unblock users in the meantime.

Backup Failure in 23.10.0 Cluster Post Upgrade (from versions <=22.10 with Apps Enabled)

Issue:

After upgrading from versions <= 22.10.x to 23.10.x with apps enabled, backups fail if MongoDB is not deleted as a post-upgrade step.

Symptoms:

When checking the Velero backup logs, errors similar to the following may be encountered:

cd /opt/UiPathAutomationSuite/installer
bin/velero backup logs asbackup-manual-1701858330 | grep "error"

time="2023-12-06T10:25:34Z" level=debug msg="received EOF, stopping recv loop" backup=velero/asbackup-manual-1701858330 cmd=/plugins/local-volume-provider err="rpc error: code = Unimplemented desc = unknown service plugin.GRPCStdio" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2023-12-06T10:26:13Z" level=info msg="1 errors encountered backup up item" backup=velero/asbackup-manual-1701858330 logSource="pkg/backup/backup.go:421" name=mongodb-replica-set-0
time="2023-12-06T10:26:13Z" level=error msg="Error backing up item" backup=velero/asbackup-manual-1701858330 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=mongodb, name=data-volume-mongodb-replica-set-0): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass longhorn-backup-single-replica: failed to get volumesnapshotclass for provisioner driver.longhorn.io, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="pkg/backup/backup.go:425" name=mongodb-replica-set-0
time="2023-12-06T10:28:13Z" level=info msg="1 errors encountered backup up item" backup=velero/asbackup-manual-1701858330 logSource="pkg/backup/backup.go:421" name=logs-volume-mongodb-replica-set-0
time="2023-12-06T10:28:13Z" level=error msg="Error backing up item" backup=velero/asbackup-manual-1701858330 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=mongodb, name=logs-volume-mongodb-replica-set-0): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass longhorn-backup-single-replica: failed to get volumesnapshotclass for provisioner driver.longhorn.io, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:325" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:425" name=logs-volume-mongodb-replica-set-0

Resolution:

Uninstall the mongodb following below doc and after this, take backup and it should start working.

Performing Post upgrade operations

Topic	Replies	Views
Configuring Automation Suite backup with a new NFS Server fails: backup storage location failed to create Knowledge Base automation_suite_deployment_and_operatio	25	January 3, 2025
Backups Stuck in 'Deleting' State in Velero on Automation Suite Knowledge Base automation_suite_deployment_and_operatio	212	January 3, 2025
Configuring Cluster Snapshot Fails For An Offline Automation Suite Installation Knowledge Base automation_suite_deployment_and_operatio	43	January 3, 2025
Automation Suite- Failing to create cluster snapshot EKS Knowledge Base automation_suite_deployment_and_operatio	23	January 3, 2025
Automation Suite on AKS Backup Store Fails to Configure with Error: "cannot update component velero: cannot execute helm installer" Knowledge Base automation_suite_deployment_and_operatio	41	January 3, 2025

Backup partially failed troubleshooting

How to troubleshoot Cluster Backups failing ?

Related topics