Organizations implementing hardening policies on Kubernetes clusters often encounter challenges that disrupt functionality. This article addresses key issues encountered during such hardening efforts, their root causes, and detailed resolutions. This information is intended to assist Kubernetes administrators and security teams in effectively managing configurations while ensuring cluster stability.
Issue Description:
- World Writable Permissions on Directories
- During hardening, certain directories such as /var/lib/rancher/rke2/agent/containerd/ were flagged for having world writable permissions (777).
- Subdirectories under io.containerd.snapshotter.v1.overlayfs were identified with 777 permissions. However, these directories were inaccessible due to the parent directory's restrictive permissions (700).
- Incorrect modifications to these permissions caused container runtime corruption.
- Incorrect Sysctl Settings
- Sysctl parameters like net.ipv4.ip_forward and net.ipv4.conf.all.rp_filter were set to values incompatible with Kubernetes requirements during hardening efforts.
- CIS hardening recommendations, though beneficial for general servers, conflicted with Kubernetes networking requirements.
- DNS Failures and Kernel Settings
- DNS queries from within the cluster failed due to kernel-level settings blocking outbound communication.
- Misconfigured /etc/resolv.conf entries and kernel parameters were identified as the root causes.
- Permission-Related Container Failures
- Recursive permission changes under /var/lib/rancher/rke2/agent/containerd/ corrupted container file systems.
- Affected containers showed errors such as permission denials during critical runtime operations.
Resolution:
1. Addressing Directory Permissions
- Ensure parent directories of /var/lib/rancher/rke2/agent/containerd/ have appropriate restrictive permissions.
- Recommended Change: Set the following parent directories to 700 without applying changes recursively:
- /var/lib/rancher
- /var/lib/kubelet
- /var/lib/pods
- Avoid modifying permissions under io.containerd.snapshotter.v1.overlayfs to prevent corruption.
2. Correcting Sysctl Settings
- Update the following sysctl parameters on all nodes as per Kubernetes requirements:
- net.bridge.bridge-nf-call-iptables = 1
- net.ipv4.ip_forward = 1
- net.bridge.bridge-nf-call-ip6tables = 1
- net.ipv4.conf.all.rp_filter = 0
- Validate settings using the UiPath Automation Suite guide.
- Avoid conflicts between global and interface-specific settings. For example, ensure consistency between net.ipv4.conf.all.rp_filter and net.ipv4.conf..rp_filter.
3. Resolving DNS Issues
- Restore correct entries in /etc/resolv.conf.
- Validate DNS functionality with:
- kubectl -n kube-system get pod -o wide | grep rke2-core | awk '{print $6}'
- nslookup
- nslookup
- Restart affected services:
- systemctl restart rke2-server || systemctl restart rke2-agent
- kubectl -n kube-system rollout restart deployment rke2-coredns-rke2-coredns
4. Restoring Container Functionality
- If permission changes corrupted container file systems:
- /opt/node-drain.sh
- rke2-killall.sh
- rm -rf /var/lib/rancher/rke2/agent/containerd
- systemctl restart rke2-server || systemctl restart rke2-agent
- These steps force RKE2 to rebuild the container file system.
Preventative Measures
- Test in Development Environments
- Always implement security changes in a development environment before production deployment to validate functionality.
- Backup and Validate Configurations
- Backup settings before applying changes.
- Restart affected systems and confirm functionality after modifications.
- Collaborate with Security Teams
- Share planned sysctl and permission updates with the security team for review.
- Validate proposed configurations with Kubernetes requirements.