Pre-checks before performing the upgrade task

Required Pre-checks before performing major upgrades of Automation Suite

Scenario:

This article seeks to cover useful pre-requisite checks that can be performed on an Automation Suite cluster prior to upgrade, with the assumption that:


a)The cluster is being directly upgraded to v23.4.X from a prior version


b)The cluster is currently utilizing ceph in-cluster storage and will continue to do so


Substantive Checks:


Always ensure take an ON-DEMAND BACKUP before engaging any significant cluster activities, such as Upgrades.


Check Backups status if Backup taken by Velero:


/path/to/installerdir/configureUiPathAS.sh snapshot list


Check Pods and Application health:


Run the following to check all Pods and Applications health, ensure that all should be in healthy state.

If any critical pod such as longhorn , rook-ceph or an application specific pod, is in a CrashLoopBackOff or Terminated State, address that issue before proceeding with the upgrade.


export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"

kubectl get pods -A -o wide

kubectl get application -A


Check Required Space for Ceph:


The following command shows how much space Ceph is currently utilizing:

ceph_object_size=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status --format json | jq -r '.pgmap.data_bytes')

echo "You need '$(numfmt --to=iec-i $ceph_object_size)' storage space"


The current ceph raw disk should have extra buffer space to cater for growth, knowing the consumed space is also useful for knowing how much space to cater for a potential migration to S3-compatible external objectstore


To check ROBUSTNESS of Volumes:


Prior to initiating the upgrade activity, it is imperative to verify the robustness of all volumes. If any volume is found to be in a non-healthy state, it must be addressed before proceeding with the upgrade.

kubectl get volume.longhorn.io -n longhorn-system



All volumes should be attached and healthy. Unexpected behavior can be investigated in the Rancher console at URL https://monitoring.


Check Longhorn Replicas:


kubectl get replicas.longhorn.io -A -o wide



All Replicas should be in Running state.


Check Volumes Attachments:


kubectl get volumeattachments -A


All Volumes Attached State should be TRUE


Check Ceph Health/PG Status:


kubectl -n rook-ceph exec -i deploy/rook-ceph-tools -- ceph status



Status should show HEALTH_OK

PGS status also should show active+clean



Check Ceph OSD Status:


kubectl -n rook-ceph exec -i deploy/rook-ceph-tools -- ceph osd df


Status should indicate as “up”.


Ceph Health Detail:


kubectl -n rook-ceph exec -i deploy/rook-ceph-tools -- ceph health detail


Expect to see check for HEALTH_OK


To check connectivity between nodes:


all_ingress_data=$(kubectl -n istio-system get pod -l app=istio-ingressgateway -o json)

all_ingress_ips=$(jq -r '.items[] | .status.podIP' <<< "${all_ingress_data}")

all_ingress_pod_name=$(jq -r '.items[] | .metadata.name' <<< "${all_ingress_data}")

while IFS='\n' read pod_name

do

echo "FROM: ${pod_name}"

while IFS='\n' read pod_ip

do

echo "To: ${pod_ip} HTTP_CODE: $(kubectl -n "istio-system" exec "${pod_name}" -- curl -m 10 -w "%{http_code}\n" --silent --output /dev/null "${pod_ip}:15021/healthz/ready")"

done <<< "${all_ingress_ips}"

done <<< "${all_ingress_pod_name}"



Status should be OK with 200 code, which indicate that all nodes have proper connectivity.