How to debug high load and server crashes in Automation suite
Issue: Server crashes or multiple apps degraded due to high server load.
Analysis:
- Run the command 'uptime' to check the load. If it is consistently above the number of CPUs then there is high load on the server.
- Run the command 'iostate' if the %iowait is consistently more then 10 then there is an issue with the IO performance.
Troubleshooting Steps:
- Follow the steps in article - How To Check Basic Server Configuration Settings For Automation Suite? to ensure that the server configuration is good.
- Run kubectl top pods -A (below is a snippet of what it might looked like). 1000m is equal to one CPU.
NAMESPACE NAME CPU(cores) MEMORY(bytes) argocd argocd-application-controller-0 53m 435Mi argocd argocd-redis-64448b47dc-fwhf7 2m 39Mi argocd argocd-repo-server-6fb6c464c4-7t9tk 7m 77Mi argocd argocd-server-6d9846f589-dgtxv 2m 39Mi cattle-fleet-local-system fleet-agent-55d445d85f-kl9mt 4m 111Mi
- As an additional check, we can check how much cpu is used on a particular node by the pods running on the node. For node name, make sure to use the node name returned by kubectl get nodes
NODE_NAME="NODE NAME HERE" # Retrieve list of pods and save to a variable PODS=$(kubectl get pods --all-namespaces -o=jsonpath='{range .items[?(@.spec.nodeName=="'"${NODE_NAME}"'")]}{.metadata.namespace} {.metadata.name}{"\n"}{end}') # Process each pod in the list and gather resource usage echo "$PODS" | while read ns pod; do # Check if the pod still exists if kubectl get pod $pod --namespace=$ns &> /dev/null; then kubectl top pod $pod --namespace=$ns fi done | awk 'NR>1 {cpu_sum += $2; mem_sum += $3} END {print cpu_sum "m", mem_sum "Mi"}'
- Run the command 'top' to see what services are consuming high cpu/memory, also observe the if there are any zombie processes. A normal output of the top command for reference.
Scenario 1: There are zombie processes
Resolution:
- Cordon the affected node
- Goto longhorn and disable the scheduling for the node as well as enable the eviction
- Once all the replicas are moved to another node, hard reboot the server
- Once it is back, enable the scheduling, and disable the eviction
- Ensure that swap is disabled and configurations in the first point is in place. If swap is enabled then disable the same.
- Repeat this one by one for all the affected nodes
- In case of agent node, just do a hard reboot and disable the swap
- This should bring the CPU and memory utilization to normal, if not then create a support ticket.
Scenario 2: If there are no zombie process and kube-apiserver, longhorn or etcd is consuming more cpu/memory
Observation: This is mostly in versions lower then 22.10.x
Troubleshooting:
- Ensure that the server configuration is good following the steps in the first point
- Check if bakckup is enabled, in case if backup is enabled, follow below steps and ensure that there are no staled backup as well as the validate if the backup configuration is correct
Run the below command to check the list of backup kubectl get backups.longhorn.io -n longhorn-system | wc -l If the count is in 100s, that means there are staled/corrupted backups. Use the below script to clear those all
- If backup is disabled, check the longhorn support bundle and if there are issues with the volumes, create a support ticket with support bundle and longhorn support bundle. Support engineer should involve rancher/longhorn support team.
Scenario 3: Swap is enabled which could be consuming more iops and making etcd slow
How to find? Check the rancher logs for Kernal and if there is a pattern like
Jun 7 16:29:37 stluipsvrp03 kernel: Call Trace:
Jun 7 16:29:37 stluipsvrp03 kernel: __schedule+0x2d1/0x870
Jun 7 16:29:37 stluipsvrp03 kernel: ? common_interrupt+0xa/0xf
Jun 7 16:29:37 stluipsvrp03 kernel: schedule+0x55/0xf0
Jun 7 16:29:37 stluipsvrp03 kernel: io_schedule+0x12/0x40
Jun 7 16:29:37 stluipsvrp03 kernel: migration_entry_wait_on_locked+0x1ea/0x290
Jun 7 16:29:37 stluipsvrp03 kernel: ? filemap_fdatawait_keep_errors+0x50/0x50
Jun 7 16:29:37 stluipsvrp03 kernel: do_swap_page+0x5b0/0x710
Jun 7 16:29:37 stluipsvrp03 kernel: ? pmd_devmap_trans_unstable+0x2e/0x40
Jun 7 16:29:37 stluipsvrp03 kernel: ? handle_pte_fault+0x5d/0x880
Jun 7 16:29:37 stluipsvrp03 kernel: __handle_mm_fault+0x453/0x6c0
Jun 7 16:29:37 stluipsvrp03 kernel: handle_mm_fault+0xca/0x2a0
Jun 7 16:29:37 stluipsvrp03 kernel: __do_page_fault+0x1f0/0x450
Jun 7 16:29:37 stluipsvrp03 kernel: ? check_preempt_wakeup+0x113/0x270
Jun 7 16:29:37 stluipsvrp03 kernel: do_page_fault+0x37/0x130
Jun 7 16:29:37 stluipsvrp03 kernel: page_fault+0x1e/0x30
Resolution: Follow the steps in scenario one.
Scenario 4: If RHEL or the underlying OS has has an ongoing issues in the specific version
Resolution: Look for the recent issues in RHEL for the given version that could cause the CPU spike. for example a published issue can be found in the attachments.
If none of the above works, create a support ticket with the below logs: