How to Diagnose Filesystem Corruption in Automation Suite

How to deal with operation not permitted or permission denied errors in a container on Automation Suite Embedded.

Issue Description:

How to deal with operation not permitted in a container on Automation Suite Embedded.

Example errors:

File "/opt/redislabs/sbin/generate_gossip_envoy_default_conf.py", line 10, in main
   generate_gossip_envoy_config()
 File "cnm/services/gossip_envoy.py", line 90, in generate_gossip_envoy_config
IOError: [Errno 13] Permission denied: '/etc/opt/redislabs/gossip_envoy.yaml'
Command '/bin/bash -c export PYTHONPATH=:/opt/redislabs/lib/cnm:/opt/redislabs/lib/cnm/python; /opt/redislabs/bin/python2.7 -O /opt/redislabs/sbin/generate_gossip_envoy_default_conf.py' exited with status: 1
chown: changing ownership of '/var/run/saslauthd': Operation not permitted
Command 'chown -R redislabs:redislabs /var/run/saslauthd' exited with status: 1
/opt/redislabs/sbin/supervisord_prestart_script.sh: line 42: type: getenforce: not found
chmod: changing permissions of '/var/run/redis': Operation not permitted
Command 'chmod 0775 /var/run/redis' exited with status: 1

Background:

The error message can happen in a container when the host machine interferes with container operations either because:

  1. The container image file system permissions were changed from the host machine.

  2. fapolicy is enabled but not configured.

  3. Selinux is enabled and causing issues.

Resolution:

  1. Check if fapolicy is enabled.

    1. If its enabled, make sure its configured as per our docs: Automation Suite - Step 8: Configuring kernel and OS level settings

    2. cat /etc/fapolicyd/rules.d/69-rke2.rules

    3. If it's not configured, run the command in our docs to configure it. If it is configured, please capture a screenshot.

  2. Check if selinux is enabled.

    1. sestatus

    2. If the status is enforcing, temporarily disable it on all nodes and see if the issue persists.

      1. sudo setenforce 0

    3. If the issue is resolved after disabling selinux, then it can be concluded that the issue was caused by selinux.

    4. If the environment is air-gapped we need to install the selinux package.

      1. On a non-airgapped machine download the rpm package with the following command: sudo dnf download rke2-selinux

      2. Once the package is downloaded, transfer it to the airgapped machine.

      3. Then run rpm -ivh

      4. After this selinux can be reneabled.

  3. If the issue persists with selinux being disabled, most likely the host file permissions has changed.

    1. It is difficult to check the permissions because it could be any random file in the directory. In such scenario rebuild the files system.

    2. Steps to rebuild the Files System:
      1.
      Drain and Stop the Node: Begin by draining the node to move workloads to other nodes (if possible) and prevent data loss.

      • kubectl drain --ignore-daemonsets --delete-local-data

      2. Run the Cleanup Script: Stop the RKE2 agent and run rke2-killall.sh to remove all components and reset the node state.

      • sudo systemctl stop rke2-agent
      • sudo /usr/local/bin/rke2-killall.sh

      3. Delete the Containerd Directory: Once the node is safely stopped, delete the containerd directory to clear any corrupted files.

      • sudo rm -rf /var/lib/rancher/rke2/agent/containerd

      4. Restart the RKE2 Agent: Start the RKE2 agent to rebuild the containerd directory and initialize a clean runtime environment.

      • sudo systemctl start rke2-agent
  4. If none of the above works, raise a support ticket. Please include the rke2 support bundle and any other artifacts gathered.