Troubleshooting High Resource Consumption In Automation Suite

system · August 8, 2023, 8:05am

How to debug high load and server crashes in Automation suite

Issue: Server crashes or multiple apps degraded due to high server load.

Analysis:

Run the command 'uptime' to check the load. If it is consistently above the number of CPUs then there is high load on the server.
Run the command 'iostate' if the %iowait is consistently more then 10 then there is an issue with the IO performance.

Troubleshooting Steps:

Follow the steps in article - How To Check Basic Server Configuration Settings For Automation Suite? to ensure that the server configuration is good.
Run kubectl top pods -A (below is a snippet of what it might looked like). 1000m is equal to one CPU.

NAMESPACE                   NAME                                                        CPU(cores)   MEMORY(bytes)
argocd                      argocd-application-controller-0                             53m          435Mi
argocd                      argocd-redis-64448b47dc-fwhf7                               2m           39Mi
argocd                      argocd-repo-server-6fb6c464c4-7t9tk                         7m           77Mi
argocd                      argocd-server-6d9846f589-dgtxv                              2m           39Mi
cattle-fleet-local-system   fleet-agent-55d445d85f-kl9mt                                4m           111Mi

As an additional check, we can check how much cpu is used on a particular node by the pods running on the node. For node name, make sure to use the node name returned by kubectl get nodes

NODE_NAME="NODE NAME HERE"

# Retrieve list of pods and save to a variable
PODS=$(kubectl get pods --all-namespaces -o=jsonpath='{range .items[?(@.spec.nodeName=="'"${NODE_NAME}"'")]}{.metadata.namespace} {.metadata.name}{"\n"}{end}')

# Process each pod in the list and gather resource usage
echo "$PODS" | while read ns pod; do 
    # Check if the pod still exists
    if kubectl get pod $pod --namespace=$ns &> /dev/null; then
        kubectl top pod $pod --namespace=$ns
    fi
done | awk 'NR>1 {cpu_sum += $2; mem_sum += $3} END {print cpu_sum "m", mem_sum "Mi"}'

Run the command 'top' to see what services are consuming high cpu/memory, also observe the if there are any zombie processes. A normal output of the top command for reference.

Scenario 1: There are zombie processes

Resolution:

Cordon the affected node
Goto longhorn and disable the scheduling for the node as well as enable the eviction
Once all the replicas are moved to another node, hard reboot the server
Once it is back, enable the scheduling, and disable the eviction
Ensure that swap is disabled and configurations in the first point is in place. If swap is enabled then disable the same.
Repeat this one by one for all the affected nodes
In case of agent node, just do a hard reboot and disable the swap
This should bring the CPU and memory utilization to normal, if not then create a support ticket.

Scenario 2: If there are no zombie process and kube-apiserver, longhorn or etcd is consuming more cpu/memory

Observation: This is mostly in versions lower then 22.10.x

Troubleshooting:

Ensure that the server configuration is good following the steps in the first point
Check if bakckup is enabled, in case if backup is enabled, follow below steps and ensure that there are no staled backup as well as the validate if the backup configuration is correct

Run the below command to check the list of backup
kubectl get backups.longhorn.io -n longhorn-system | wc -l
 

If the count is in 100s, that means there are staled/corrupted backups. Use the below script to clear those all

If backup is disabled, check the longhorn support bundle and if there are issues with the volumes, create a support ticket with support bundle and longhorn support bundle. Support engineer should involve rancher/longhorn support team.

Scenario 3: Swap is enabled which could be consuming more iops and making etcd slow

How to find? Check the rancher logs for Kernal and if there is a pattern like

Jun 7 16:29:37 stluipsvrp03 kernel: Call Trace:

Jun 7 16:29:37 stluipsvrp03 kernel: __schedule+0x2d1/0x870

Jun 7 16:29:37 stluipsvrp03 kernel: ? common_interrupt+0xa/0xf

Jun 7 16:29:37 stluipsvrp03 kernel: schedule+0x55/0xf0

Jun 7 16:29:37 stluipsvrp03 kernel: io_schedule+0x12/0x40

Jun 7 16:29:37 stluipsvrp03 kernel: migration_entry_wait_on_locked+0x1ea/0x290

Jun 7 16:29:37 stluipsvrp03 kernel: ? filemap_fdatawait_keep_errors+0x50/0x50

Jun 7 16:29:37 stluipsvrp03 kernel: do_swap_page+0x5b0/0x710

Jun 7 16:29:37 stluipsvrp03 kernel: ? pmd_devmap_trans_unstable+0x2e/0x40

Jun 7 16:29:37 stluipsvrp03 kernel: ? handle_pte_fault+0x5d/0x880

Jun 7 16:29:37 stluipsvrp03 kernel: __handle_mm_fault+0x453/0x6c0

Jun 7 16:29:37 stluipsvrp03 kernel: handle_mm_fault+0xca/0x2a0

Jun 7 16:29:37 stluipsvrp03 kernel: __do_page_fault+0x1f0/0x450

Jun 7 16:29:37 stluipsvrp03 kernel: ? check_preempt_wakeup+0x113/0x270

Jun 7 16:29:37 stluipsvrp03 kernel: do_page_fault+0x37/0x130

Jun 7 16:29:37 stluipsvrp03 kernel: page_fault+0x1e/0x30

Resolution: Follow the steps in scenario one.

Scenario 4: If RHEL or the underlying OS has has an ongoing issues in the specific version

Resolution: Look for the recent issues in RHEL for the given version that could cause the CPU spike. for example a published issue can be found in the attachments.

If none of the above works, create a support ticket with the below logs:

Topic		Replies	Views
How To Check Basic Server Configuration Settings For Automation Suite? Knowledge Base automation_suite_deployment_and_operatio	0	256	August 8, 2023
Automation getting struck Studio studio , question	8	129	April 30, 2024
Automation Suite/Orchestrator no longer working after changing Ip address to setup load balancer Automation Suite question , automation_suite	0	136	April 30, 2024
Prometheus Is Causing Disk And Memory Pressure In Automation Suite Cluster Causing Node To Crash Knowledge Base automation_suite_deployment_and_operatio	0	469	August 8, 2023
Longhorn-csi-plugin Fails To Start Knowledge Base automation_suite_deployment_and_operatio	0	5288	January 3, 2023

Troubleshooting High Resource Consumption In Automation Suite

How to debug high load and server crashes in Automation suite

Related topics