How to debug high load and server crashes in Automation suite
Issue: Server crashes or multiple apps degraded due to high server load.
Analysis:
- Run the command 'uptime' to check the load. If it is consistently above 50 then there is high load on the server.
- Run the command 'iostate' if the %iowait is consistently more then 10 then there is an issue with the IO performance.
Troubleshooting Steps:
- Follow the steps in article - How To Check Basic Server Configuration Settings For Automation Suite? to ensure that the server configuration is good.
- Run the command 'top' to see what services are consuming high cpu/memory, also observe the if there are any zombie processes. A normal output of the top command for reference.
Scenario 1: There are zombie processes.
Resolution:
- Cordon the affected node
- Goto longhorn and disable the scheduling for the node as well as enable the eviction.
- Once all the replicas are moved to another node, hard reboot the server.
- Once it is back, enable the scheduling, and disable the eviction.
- Please ensure that swap is disabled and configurations in the first point is in place. If swap is enabled then disable the same.
- Repeat this one by one for all the affected nodes.
- In case of agent node, just do a hard reboot and disable the swap.
- This should bring the cpu and memory utilization to normal, if not then create a support ticket.
Scenario 2: If there are no zombie process and kube-apiserver, longhorn or etcd is consuming more cpu/memory
Observation: This is mostly in versions lower then 22.10.x
Troubleshooting:
- Ensure that the server configuration is good following the steps in the first point.
- Check if bakckup is enabled, in case if backup is enabled, follow below steps and ensure that there are no staled backup as well as the validate if the backup configuration is correct.
Run the below command to check the list of backup kubectl get backups.longhorn.io -n longhorn-system | wc -l If the count is in 100s, that means there are staled/corrupted backups. Use the below script to clear those all <for CRD in $(kubectl get backups.longhorn.io -n longhorn-system | grep -v NAME | cut -d " " -f 1 | xargs); do kubectl patch backups.longhorn.io -n longhorn-system $CRD --type merge -p '{"metadata":{"finalizers": [null]}}'; kubectl delete backups.longhorn.io -n longhorn-system $CRD; done>
- If backup is disabled, check the longhorn support bundle and if there are issues with the volumes, create a support ticket with support bundle and longhorn support bundle. Support engineer should involve rancher/longhorn support team.
Scenario 3: Swap is enabled which could be consuming more iops and making etcd slow
How to find? Check the rancher logs for Kernal and if there is a pattern like
Jun 7 16:29:37 stluipsvrp03 kernel: Call Trace:
Jun 7 16:29:37 stluipsvrp03 kernel: __schedule+0x2d1/0x870
Jun 7 16:29:37 stluipsvrp03 kernel: ? common_interrupt+0xa/0xf
Jun 7 16:29:37 stluipsvrp03 kernel: schedule+0x55/0xf0
Jun 7 16:29:37 stluipsvrp03 kernel: io_schedule+0x12/0x40
Jun 7 16:29:37 stluipsvrp03 kernel: migration_entry_wait_on_locked+0x1ea/0x290
Jun 7 16:29:37 stluipsvrp03 kernel: ? filemap_fdatawait_keep_errors+0x50/0x50
Jun 7 16:29:37 stluipsvrp03 kernel: do_swap_page+0x5b0/0x710
Jun 7 16:29:37 stluipsvrp03 kernel: ? pmd_devmap_trans_unstable+0x2e/0x40
Jun 7 16:29:37 stluipsvrp03 kernel: ? handle_pte_fault+0x5d/0x880
Jun 7 16:29:37 stluipsvrp03 kernel: __handle_mm_fault+0x453/0x6c0
Jun 7 16:29:37 stluipsvrp03 kernel: handle_mm_fault+0xca/0x2a0
Jun 7 16:29:37 stluipsvrp03 kernel: __do_page_fault+0x1f0/0x450
Jun 7 16:29:37 stluipsvrp03 kernel: ? check_preempt_wakeup+0x113/0x270
Jun 7 16:29:37 stluipsvrp03 kernel: do_page_fault+0x37/0x130
Jun 7 16:29:37 stluipsvrp03 kernel: page_fault+0x1e/0x30
Resolution: Follow the steps in scenario one.
Scenario 4: If RHEL or the underlying OS has has an ongoing issues in the specific version
Resolution: Look for the recent issues in RHEL for the given version that could cause the CPU spike. for example a published issue can be found in the attachments.
If none of the above works, create a support ticket with the below logs: