Required Pre-checks before performing major upgrades of Automation Suite
Scenario:
This article seeks to cover useful pre-requisite checks that can be performed on an Automation Suite cluster prior to upgrade, with the assumption that:
a)The cluster is being directly upgraded to v23.4.X from a prior version
b)The cluster is currently utilizing ceph in-cluster storage and will continue to do so
Substantive Checks:
Always ensure take an ON-DEMAND BACKUP before engaging any significant cluster activities, such as Upgrades.
Check Backups status if Backup taken by Velero:
/path/to/installerdir/configureUiPathAS.sh snapshot list
Check Pods and Application health:
Run the following to check all Pods and Applications health, ensure that all should be in healthy state.
If any critical pod such as longhorn , rook-ceph or an application specific pod, is in a CrashLoopBackOff or Terminated State, address that issue before proceeding with the upgrade.
export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
kubectl get pods -A -o wide
kubectl get application -A
Check Required Space for Ceph:
The following command shows how much space Ceph is currently utilizing:
ceph_object_size=$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status --format json | jq -r '.pgmap.data_bytes')
echo "You need '$(numfmt --to=iec-i $ceph_object_size)' storage space"
The current ceph raw disk should have extra buffer space to cater for growth, knowing the consumed space is also useful for knowing how much space to cater for a potential migration to S3-compatible external objectstore
To check ROBUSTNESS of Volumes:
Prior to initiating the upgrade activity, it is imperative to verify the robustness of all volumes. If any volume is found to be in a non-healthy state, it must be addressed before proceeding with the upgrade.
kubectl get volume.longhorn.io -n longhorn-system
All volumes should be attached and healthy. Unexpected behavior can be investigated in the Rancher console at URL https://monitoring.
Check Longhorn Replicas:
kubectl get replicas.longhorn.io -A -o wide
All Replicas should be in Running state.
Check Volumes Attachments:
kubectl get volumeattachments -A
All Volumes Attached State should be TRUE
Check Ceph Health/PG Status:
kubectl -n rook-ceph exec -i deploy/rook-ceph-tools -- ceph status
Status should show HEALTH_OK
PGS status also should show active+clean
Check Ceph OSD Status:
kubectl -n rook-ceph exec -i deploy/rook-ceph-tools -- ceph osd df
Status should indicate as “up”.
Ceph Health Detail:
kubectl -n rook-ceph exec -i deploy/rook-ceph-tools -- ceph health detail
Expect to see check for HEALTH_OK
To check connectivity between nodes:
all_ingress_data=$(kubectl -n istio-system get pod -l app=istio-ingressgateway -o json)
all_ingress_ips=$(jq -r '.items[] | .status.podIP' <<< "${all_ingress_data}")
all_ingress_pod_name=$(jq -r '.items[] | .metadata.name' <<< "${all_ingress_data}")
while IFS='\n' read pod_name
do
echo "FROM: ${pod_name}"
while IFS='\n' read pod_ip
do
echo "To: ${pod_ip} HTTP_CODE: $(kubectl -n "istio-system" exec "${pod_name}" -- curl -m 10 -w "%{http_code}\n" --silent --output /dev/null "${pod_ip}:15021/healthz/ready")"
done <<< "${all_ingress_ips}"
done <<< "${all_ingress_pod_name}"
Status should be OK with 200 code, which indicate that all nodes have proper connectivity.