Prometheus Is Causing Disk And Memory Pressure In Automation Suite Cluster Causing Node To Crash

When Prometheus is enabled, why does it cause high resource consumption and cause cluster to fail?

Issue Description: Prometheus monitoring pods are over utilizing resources causing cluster to go in a failed state.

Diagnosing Issue:
Review System uptime and ilogs

Check if this directory is large (> 5GB)

  • kubectl exec -it -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 -- du -sh /prometheus/wal
  • kubectl exec -it -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-1 -- du -sh /prometheus/wal



Root Cause: Temporary files in the prometheus/wal directory causes pressure at startup, leaving the cluster in an unusable state.



Resolution:

  1. If larger than 5GB, delete files from /prometheus/wal directory
  • kubectl exec -it -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 -- rm -rf /prometheus/wal/*
  • kubectl exec -it -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-1 -- rm -rf /prometheus/wal/*

  1. If above does not work, delete the PVC’s to let them rebuild
  2. Scale down sts
  • kubectl scale sts -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus --replicas=0
  1. Scale down operator
  • kubectl scale deploy -n cattle-monitoring-system rancher-monitoring-operator --replicas=0

  1. Run below until pods are gone (i.e. it returns nothing)
  • kubectl get pods -n cattle-monitoring-system | grep -E "operator|prometheus-rancher"

  1. Delete pvcs below
  • kubectl delete pvc prometheus-rancher-monitoring-prometheus-db-prometheus-rancher-monitoring-prometheus-0 -n cattle-monitoring-system
  • kubectl delete pvc prometheus-rancher-monitoring-prometheus-db-prometheus-rancher-monitoring-prometheus-1 -n cattle-monitoring-system

  1. Scale up operator
  • kubectl scale deploy -n cattle-monitoring-system rancher-monitoring-operator --replicas=1