Alert configurations (email and webhook receivers) get lost on reboot of machine or re-sync of alertmanager-config secret.
Issue
Alert configurations (email and webhook receivers) get lost on reboot of machine or re-sync of alertmanager-config secret.
Affected Versions
AS 22.10.0 to 22.10.11 and AS 23.4.0-23.4.2Root cause
On reboot of any machine if argocd sync is triggered or alertmanager-config secret is re-synced, any custom configurations (via the rancher server) are overwritten by the default configurations.
Solution
For AS 23.4.0 to 23.4.2
Configure argocd to not overwrite custom configurations on reboot. To do that, one should select the option “Respect Ignore Difference” for application fabric-installer in argocd.
For 22.10.x
Automation Suite versions 22.10.0-11 do not support argocd version 2.3.x and above which has the option to respectIgnoreDifference.
To mitigate the issue in the current 22.10 versions, a cronjob can be set up. The cronjob regularly checks for changes to alert manager configurations. Any new configuration is backed up to a temporary secret. If for any reason the custom configuration is removed, the cronjob would restore back the custom configurations from the temporary secret.
Following are the steps to run the cronjob:
- Copy below yaml file to a machine that is up and running and save it to any location.
apiVersion: batch/v1 kind: CronJob metadata: name: backup-restore-alert-config namespace: cattle-monitoring-system spec: schedule: "0 * * * *" jobTemplate: spec: template: spec: containers: - name: backup-restore-alert-config image: uipath/sf-k8-utils-rhel:image_version env: - name: ALERTMANAGER_CONFIG value: "c94aad3ffbfe5f19165ac4800267ec0ae599fc430c0395439429ddf7368dcd7f" securityContext: privileged: false allowPrivilegeEscalation: false readOnlyRootFilesystem: false runAsUser: 9999 runAsGroup: 9999 runAsNonRoot: true capabilities: drop: ["NET_RAW"] command: ["/bin/bash", "-ec"] args: - | config=$(kubectl get secret alertmanager-config -n cattle-monitoring-system -o=jsonpath='{.data.alertmanager\.yaml}') echo "$config" | base64 -d > /tmp/temp_config.yaml echo "ALERTMANAGER_CONFIG: $ALERTMANAGER_CONFIG" config_sha=$(sha256sum /tmp/temp_config.yaml | awk '{print $1}') echo "config_sha: $config_sha" rm -f /tmp/temp_config.yaml if [ "$ALERTMANAGER_CONFIG" != "$config_sha" ]; then if kubectl get secret alertmanager-config-tmp -n cattle-monitoring-system; then echo "alertmanager-config-tmp already exists, taking backup from alertmanager-config secret..." echo "$config" | base64 -d kubectl patch secret alertmanager-config-tmp -n cattle-monitoring-system -p '{"data":{"alertmanager.yaml":"'"${config}"'"}}' else echo "Creating new alertmanager-config-tmp secret..." kubectl get secret alertmanager-config -n cattle-monitoring-system -o yaml | sed -e 's/name: .*/name: alertmanager-config-tmp/' -e '/labels:/,+1d' | kubectl apply -f - fi elif [ "$ALERTMANAGER_CONFIG" == "$config_sha" ] && kubectl get secret alertmanager-config-tmp -n cattle-monitoring-system; then backup_config=$(kubectl get secret alertmanager-config-tmp -n cattle-monitoring-system -o=jsonpath='{.data.alertmanager\.yaml}') echo "$backup_config" | base64 -d > /tmp/temp_config.yaml cat /tmp/temp_config.yaml backup_config_sha=$(sha256sum /tmp/temp_config.yaml | awk '{print $1}') rm -f /tmp/temp_config.yaml echo "backup_config_sha: $backup_config_sha" if [ "$ALERTMANAGER_CONFIG" != "$backup_config_sha" ]; then echo "Configuration was reset in alertmanager-config secret. Restoring the backup config..." kubectl patch secret alertmanager-config -n cattle-monitoring-system -p '{"data":{"alertmanager.yaml":"'"${backup_config}"'"}}' fi fi restartPolicy: Never serviceAccountName: rancher-monitoring-operator
- Replace the image_version. To identify the relevant image version as per your environment, go to the /opt/UiPathAutomationSuite/Uipath_Installer/versions/docker-images.json file and search for version of the image uipath/sf-k8-utils-rhel.
- Apply the yaml file using kubectl apply -f file_name.yaml.
Changing schedule
Default schedule is 1hr. To update, modify schedule: "0 * * * *"
Removing alert configurations
- If one wants to revert back to default alert configuration (i.e remove all email or webhook receivers), they should delete the cronjob using below command and then re-sync alertmanger-config secret from the argocd UI.
kubectl -n cattle-monitoring-system delete cronjob backup-restore-alert-config
Logging
- On restore following logs are added to the pod logs that is rolled out by this cronjob. Pod name starts with the backup-restore-alert-config prefix in the cattle-monitoring-system namespace.
Limitations
-
Custom configuration can still be lost if a machine reboot or re-sync of alertmanager-config secret happens in between alert configuration update and the job run. To minimise the possibility, cronjob’s schedule can be adjusted based on how ofter this email configuration is changed.