Persisting Alert Manager configurations during machine reboot

Alert configurations (email and webhook receivers) get lost on reboot of machine or re-sync of alertmanager-config secret.

Issue

Alert configurations (email and webhook receivers) get lost on reboot of machine or re-sync of alertmanager-config secret.

Affected Versions

AS 22.10.0 to 22.10.11 and AS 23.4.0-23.4.2

Root cause

On reboot of any machine if argocd sync is triggered or alertmanager-config secret is re-synced, any custom configurations (via the rancher server) are overwritten by the default configurations.

Solution

For AS 23.4.0 to 23.4.2

Configure argocd to not overwrite custom configurations on reboot. To do that, one should select the option “Respect Ignore Difference” for application fabric-installer in argocd.

For 22.10.x

Automation Suite versions 22.10.0-11 do not support argocd version 2.3.x and above which has the option to respectIgnoreDifference.

To mitigate the issue in the current 22.10 versions, a cronjob can be set up. The cronjob regularly checks for changes to alert manager configurations. Any new configuration is backed up to a temporary secret. If for any reason the custom configuration is removed, the cronjob would restore back the custom configurations from the temporary secret.

Following are the steps to run the cronjob:

  1. Copy below yaml file to a machine that is up and running and save it to any location.
apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-restore-alert-config
  namespace: cattle-monitoring-system
spec:
  schedule: "0 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup-restore-alert-config
            image: uipath/sf-k8-utils-rhel:image_version
            env:
            - name: ALERTMANAGER_CONFIG
              value: "c94aad3ffbfe5f19165ac4800267ec0ae599fc430c0395439429ddf7368dcd7f"
            securityContext:
              privileged: false
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: false
              runAsUser: 9999
              runAsGroup: 9999
              runAsNonRoot: true
              capabilities:
                drop: ["NET_RAW"]
            command: ["/bin/bash", "-ec"]
            args:
            - |
              config=$(kubectl get secret alertmanager-config -n cattle-monitoring-system -o=jsonpath='{.data.alertmanager\.yaml}')
              echo "$config" | base64 -d > /tmp/temp_config.yaml
              echo "ALERTMANAGER_CONFIG: $ALERTMANAGER_CONFIG"
              config_sha=$(sha256sum /tmp/temp_config.yaml | awk '{print $1}')
              echo "config_sha: $config_sha"
              rm -f /tmp/temp_config.yaml
              if [ "$ALERTMANAGER_CONFIG" != "$config_sha" ]; then
                if kubectl get secret alertmanager-config-tmp -n cattle-monitoring-system; then
                  echo "alertmanager-config-tmp already exists, taking backup from alertmanager-config secret..."
                  echo "$config" | base64 -d
                  kubectl patch secret alertmanager-config-tmp -n cattle-monitoring-system -p '{"data":{"alertmanager.yaml":"'"${config}"'"}}'
                else
                  echo "Creating new alertmanager-config-tmp secret..."
                  kubectl get secret alertmanager-config -n cattle-monitoring-system -o yaml | sed -e 's/name: .*/name: alertmanager-config-tmp/' -e '/labels:/,+1d' | kubectl apply -f -
                fi
              elif [ "$ALERTMANAGER_CONFIG" == "$config_sha" ] && kubectl get secret alertmanager-config-tmp -n cattle-monitoring-system; then
                backup_config=$(kubectl get secret alertmanager-config-tmp -n cattle-monitoring-system -o=jsonpath='{.data.alertmanager\.yaml}')
                echo "$backup_config" | base64 -d > /tmp/temp_config.yaml
                cat /tmp/temp_config.yaml
                backup_config_sha=$(sha256sum /tmp/temp_config.yaml | awk '{print $1}')
                rm -f /tmp/temp_config.yaml
                echo "backup_config_sha: $backup_config_sha"
                if [ "$ALERTMANAGER_CONFIG" != "$backup_config_sha" ]; then
                  echo "Configuration was reset in alertmanager-config secret. Restoring the backup config..."
                  kubectl patch secret alertmanager-config -n cattle-monitoring-system -p '{"data":{"alertmanager.yaml":"'"${backup_config}"'"}}'
                fi
              fi
          restartPolicy: Never
          serviceAccountName: rancher-monitoring-operator

  1. Replace the image_version. To identify the relevant image version as per your environment, go to the /opt/UiPathAutomationSuite/Uipath_Installer/versions/docker-images.json file and search for version of the image uipath/sf-k8-utils-rhel.
  2. Apply the yaml file using kubectl apply -f file_name.yaml.

Changing schedule

Default schedule is 1hr. To update, modify schedule: "0 * * * *"

Removing alert configurations

  • If one wants to revert back to default alert configuration (i.e remove all email or webhook receivers), they should delete the cronjob using below command and then re-sync alertmanger-config secret from the argocd UI.
kubectl -n cattle-monitoring-system delete cronjob backup-restore-alert-config

Logging

  • On restore following logs are added to the pod logs that is rolled out by this cronjob. Pod name starts with the backup-restore-alert-config prefix in the cattle-monitoring-system namespace.

Limitations

  • Custom configuration can still be lost if a machine reboot or re-sync of alertmanager-config secret happens in between alert configuration update and the job run. To minimise the possibility, cronjob’s schedule can be adjusted based on how ofter this email configuration is changed.