When Prometheus is enabled, why does it cause high resource consumption and cause cluster to fail?
Issue Description: Prometheus monitoring pods are over utilizing resources causing cluster to go in a failed state.
Diagnosing Issue:
Review System uptime and ilogs
Check if this directory is large (> 5GB)
- kubectl exec -it -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 -- du -sh /prometheus/wal
- kubectl exec -it -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-1 -- du -sh /prometheus/wal
Root Cause: Temporary files in the prometheus/wal directory causes pressure at startup, leaving the cluster in an unusable state.
Resolution:
- If larger than 5GB, delete files from /prometheus/wal directory
- kubectl exec -it -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 -- rm -rf /prometheus/wal/*
- kubectl exec -it -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-1 -- rm -rf /prometheus/wal/*
- If above does not work, delete the PVC’s to let them rebuild
- Scale down sts
- kubectl scale sts -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus --replicas=0
- Scale down operator
- kubectl scale deploy -n cattle-monitoring-system rancher-monitoring-operator --replicas=0
- Run below until pods are gone (i.e. it returns nothing)
- kubectl get pods -n cattle-monitoring-system | grep -E "operator|prometheus-rancher"
- Delete pvcs below
- kubectl delete pvc prometheus-rancher-monitoring-prometheus-db-prometheus-rancher-monitoring-prometheus-0 -n cattle-monitoring-system
- kubectl delete pvc prometheus-rancher-monitoring-prometheus-db-prometheus-rancher-monitoring-prometheus-1 -n cattle-monitoring-system
- Scale up operator
- kubectl scale deploy -n cattle-monitoring-system rancher-monitoring-operator --replicas=1