RKE2 Fails To Start

Resolution when RKE2 fails to start.

Issue Description:

RKE2 fails to start

Root Cause:

If RKE is failing to start, it is probably caused by one of the following,

  • An invalid configuration was applied to the service.
  • The machine configuration is invalid (this would only happen at install time and is unlikely)
  • Another process on the machine is causing the issue. Most often there is some security application that could be killing the process.

Diagnosing:

  1. First, see the list of security applications known to cause issues with kubernetes at the bottom of this article.
  2. The unit file for RKE2 is set to always restart, so if something is killing the process, it will immediately restart. This could make debugging difficult. As such, modify the rke2 unit file so that it will not restart automatically.
    1. The following command will make a backup and apply the change for a server node:
      • cp /usr/lib/systemd/system/rke2-server.service ~/
      • sed -i '/Restart=always/d' /usr/lib/systemd/system/rke2-server.service
      • systemctl daemon-reload
    2. To restore the file for a server node, run:
      • yes | cp ~/rke2-server.service /usr/lib/systemd/system/rke2-server.service
      • systemctl daemon-reload
    3. The following command will make a backup and apply the change for an agent node:
      • cp /usr/lib/systemd/system/rke2-agent.service ~/
      • sed -i '/Restart=always/d' /usr/lib/systemd/system/rke2-agent.service
      • systemctl daemon-reload
    4. To restore the file for an agent node, run:
      • yes | cp ~/rke2-agent.service /usr/lib/systemd/system/rke2-agent.service
      • systemctl daemon-reload
  3. The change can be verified with the following command:
    • systemctl cat rke2-server
    • The restart setting should be removed (HINT: There will be no line saying 'Restart=always')
  4. Restart the rke2 service. This is necessary to make sure that the rke2-service exits with an exit code.
    • For servers: systemctl start rke2-server
    • For agents: systemctl start rke2-agent
  5. Next, check the status of the service
    1. For server nodes: systemctl status rke2-server
    2. For agent nodes: systemctl status rke2-agent
  6. The status will give an indication of what is happening to the service. Pay attention to the following lines:
    1. Active: This lines denotes the status of the service along with a timestamp of when the service was started or when it stopped.
      1. activating - the RKE2 service is still starting up. In this case, wait for the startup to fail.
        • If it remains in the activating state for a very long time, we will want to look at the RKE2 logs as describe below.
        • If the service does seem stuck activating, try running following:
          • rke2-killall.sh
          • systemctl restart rke2-server
          • or systemctl restart rke2-agent
        • The above command will clear out any orphaned processes. This might get RKE2 out of an activating state.
      2. inactive - the RKE2 service was stopped using systemctl stop. However, RKE2 exits with a code of 1 when stopped with systemctl stop. As such, this typically won't occur until the exit code behavior is changed.
        • The systemd logs can be checked to see if the service was stopped using systemctl.
      3. failed - RKE2 service exited with an exit status of 1 or some other exit code.
        • This can occur if someone has stopped the RKE service with systemctl stop or was stopped by systemd (see above). To check if the service was stopped with systemctl or by systemd, run the following command:
          • journalctl | grep 'Stopped Rancher'
          • If the service has been stopped by systemd, the above command will return something like (there will probably be multiple lines): Jan 14 21:57:01 autosuite systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
          • If messages like the above are returned, check the timestamps and see if they correspond to the the timestamp of when the service stopped.
        • It can also occur on unexpected stops. If the stop was unexpected, there will be no corresponding log message of systemd stopping the service as described above.
      4. active - means the RKE2 service is running.
        • In this case the issue could be intermittent, or there was a transient issue. Treat this like the service was stopped unexpectedly.
    2. Main PID:
      1. This will give information about the status of the current running process.
      2. Linux exit codes are listed here: Exit Codes
        • For some exit codes, the exit code will be 128 + . For example, an exit code of 137 is 128 + 9 and indicates another process sent the RKE2 process a SIGNALKILL (9), which is an immediate kill.
      3. We mostly care about the case where the service is in a failed state and what the exit code was. An example of the what the status might look like in
        • Main PID: 564151 (code=exited, status=1/FAILURE)
        • Main PID: 564151 (code=exited, status=KILL)
      4. For an exit code of one (i.e. 'code=exited, status=1/FAILURE')
        • Check to see if systemd stopped the service as describe above.
        • If systemd stopped the service, that means that either someone stopped the service using systemctl or that a process told systemd to stop the service. It could also mean the system switched to a different systemd target (rke2 is part of the multi-user.target). If this seems to be the case, the linux admin needs to take a look at the system to see why this is occurring.
        • If systemd did not stop the service, we will need to check the journal and audit logs.
      5. For an exit code of (code=killed, signal=KILL)
        • This means that the RKE2 received a SIGKILL. This forces the process to be terminated. In most cases this is caused by security software. In this scenario we would need to check the audit logs for kill signals.
      6. For any other exit codes, the next step would be to check the RKE2 logs and possibly the audit logs.
  7. Next, check the journal logs of the machine. To do this run the following command:
    1. For a server: journalctl -r -u rke2-server
    2. For an agent: journalctl -r -u rke2-agent
  8. Analyzing the RKE2 logs in depth is beyond the scope of this article. However here are some simple tips:
    1. Look for error messages. Sometimes they explain the exact issue.
    2. Look for 'containerd: signal: killed' - This indicates containerd, exited. This will result in RKE2 shutting down.
      • In this scenario we would have to debug why containerd is stopping. The logs are located at: /var/lib/rancher/rke2/agent/containerd/containerd.log
      • If the logs end unexpectedly without error, that indicates that the containerd process is being terminated by another process. We would want to check the audit logs for kill signals (see below).
    3. Look for the following string to find fatal errors 'level=fatal'
  9. If it is suspected that the RKE2 service (or containerd) is being terminated by another process, the auditd logs can be used to verify this.
    1. The following command will add an entry in the auditd logs to record kill signals with the tag 'audit_kill'.
      • auditctl -a exit,always -F arch=$(uname -m) -S kill -k audit_kill
    2. After adding the rule, verify that it exists:
      • auditctl -l
      • There should be an entry that looks something like (it may very depending on architecture): -a always,exit -F arch=b64 -S kill -F key=audit_kill
    3. Next, start the RKE2 service (if checking to see if containerd is being stopped, starting RKE2 will also start containerd.)
      • For server nodes: systemctl restart rke2-server
      • For agent nodes: systemctl restart rke2-server
    4. Wait for RKE2 to exit.
    5. Query the audit logs for any kills signals, using the PID from the previous command.
      • ausearch -k audit_kill | grep -A 1 -B 1 rke2
        • For containerd, replace 'rke2' with containerd.
      • Note: On some systems, the audit logs might rotate while this operation is being performed. If no entry for the kill signal is found, verify that the logs were not rotated after the RKE2 process was killed. To check this run:
        1. systemctl status auditd
        2. The last log rotation should be noted there in the log snippit at the bottom of the response.
        3. This can be compared to the time stamp of when RKE2 stopped.
    6. If another process killed RKE2 the output of the audit log search will return a three lined output:
      • type=PROCTITLE msg=audit(1673744139.174:1056135): proctitle="-sh"
      • type=OBJ_PID msg=audit(1673744139.174:1056135): opid=833725 oauid=-1 ouid=0 oses=-1 ocomm="rke2"
      • type=SYSCALL msg=audit(1673744139.174:1056135): arch=c000003e syscall=62 success=yes exit=0 a0=cb8bd a1=9 a2=0 a3=7fa6f26ba9a0 items=0 ppid=312941 pid=570538 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=8 comm="sh" exe="/usr/bin/bash" key="audit_kill"
    7. In the above, the line OBJ_PID line is the process that was killed. The 'ocomm' field is the name of the process (rke2) and the opid field is the PID of the process.
    8. The SYSCALL line represents the process the killed the RKE2 server. The 'exe' field is the executable involved and the 'pid' field is the PID.
    9. In the example we see that the rke2 service was killed by the executable /usr/bin/bash (which was us simulating the issue using kill -9).
  10. If none of the above helps identify the issue, generate an RKE2 bundle and open a ticket with UiPath.
    1. To generate an RKE2 bundle, follow the steps here: The Rancher v2.x Linux log collector script
    2. Also check our docs, in case our troubleshooting guide has been updated for installation specific support bundle. If it has, use that instead of the suse tool. (the RKE2 bundle will be incorporated into our tool in the future). Using Support Bundle Tool
    3. In the ticket opened with UiPath please include the support bundle, the steps taken so for and any details discovered going through this KB article.

Security Software Known to Cause Issues:

Note: None of these applications are known to be incompatible with kubernetes. However, if this software is being used, please contact the vendor or your admin to understand how to correctly configure them in an on prem hosted kubernetes environment.

  1. Palo Alto Networks Cortex XDR Agent Daemon - listed as traps_pmd.service in systemd
  2. SentinelOne - listed as sentintelone.service in systemd.