Redis probe failure and rebuilding Redis cluster

Redis probe failure, how to rebuild Redis cluster entirely?

Issue Description:

After certain node events (none graceful reboot or service restart) the Redis component may need to recover. The process should be automatic but we can speed up or force the recovery process with the steps outlined in this article.


Background:

The Redis component in UiPath is not deployed persistently. This means that anytime the Redis services are completely restarted, the Redis DB needs to be rebuilt. In normal operating conditions this means:

  1. For users with the HAA add-on, the Redis DB would typically be rebuilt if all replicas for the cluster were terminated (we create three replicas all on different nodes). Typically three replicas being terminated would mean that: the cluster was shutdown, three nodes were drained and they happened to have the three Redis replicas on them, etc.
  2. For users that do not have the HAA admin, any time the Redis replica is terminated the DB will need to be rebuilt.
  3. If the Redis cluster becomes unhealthy that too could trigger a rebuild.
The above means that if the Redis DB needs to be rebuilt, there should be a specific event that triggered the sequence of events.

Additionally the rebuild process is triggered by a job that runs every 5 minutes and it should automatically recover from any issues. Running the below command should only be required to speed up the process (or theoretically, if a race condition had developed, this would get the system out of the condition. As of writing this we are not aware of any such conditions.)

Resolution:

  1. If we are not sure why the Redis cluster needs to be rebuilt, please run the support bundle tool.
    1. Tool can be found here.
    2. Run the command with the -D and -F parameters. This is so we can get historical logs.
      1. -F should be the current date.
      2. -D should be how many days back the issue started. In most cases it would be 1.
      3. i.e. support-bundle.sh -F 2024-01-01 -D 2
  2. The below commands will disable argo sync, delete the Redis database and the Redis Cluster resource,re-enable argo sync and then kick of a recovery job.
    1. kubectl -n argocd patch application fabric-installer --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'
      kubectl -n argocd patch application redis-cluster --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'
      kubectl -n argocd patch application redis-operator --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":false}]'
      kubectl delete redb -n redis-system redis-cluster-db --force --grace-period=0 &
      kubectl delete rec -n redis-system redis-cluster --force --grace-period=0 &
      kubectl patch redb -n redis-system redis-cluster-db --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"finalizer.redisenterprisedatabases.app.redislabs.com"}]'
      kubectl patch rec redis-cluster -n redis-system --type=json -p '[{"op":"remove","path":"/metadata/finalizers","value":"redbfinalizer.redisenterpriseclusters.app.redislabs.com"}]'
      kubectl -n redis-system get pods | grep services-rigger | awk '{print $1}' | xargs kubectl -n redis-system delete pod --force
      kubectl -n redis-system get pods | grep -E "redis-cluster-[0-2]" | awk '{print $1}' | xargs kubectl -n redis-system delete pod --force
      kubectl -n argocd patch application fabric-installer --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":true}]'
      kubectl -n argocd patch application redis-cluster --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":true}]'
      kubectl -n argocd patch application redis-operator --type=json -p '[{"op":"replace","path":"/spec/syncPolicy/automated/selfHeal","value":true}]'
      kubectl -n redis-system create job --from=cronjob/redis-cluster-recovery-job cronjob-manual-run 
      
  3. After the commands above are executed, we can watch the Redis cluster recover (We can also do all this from argocd):
    1. kubectl -n redis-system get pods -w
  4. We want the job redis-cluster-db-job to complete successfully. One we see it created (from the above command), hit control + C and then we can follow the logs for the redis-cluster-db-job to see if the Redis cluster is recover.
    1. The job will create a pod that looks like: redis-cluster-db-job-xxxxx
    2. Once it is created we can follow its logs:
      1. kubectl -n redis-system logs jobs/redis-cluster-db-job
    3. Last message should look like:
      1. secret/redb-redis-cluster-db patched (no change)
        [INFO] [2024-06-10T14:24:01-0400]: Patched the shared Redis Database secret with new fields