Automation Suite - DNS Validation Failed From Container

[ERROR] Failed the cluster infrastructure setup DNS Validation Failed from container.

Issue Description: During Automation Suite the following error is encountered: [ERROR] Failed the cluster infrastructure setup DNS Validation Failed from container.

Root Cause: The error means that while the domains associated with the FQDN used for automation suite were resolvable from the host, but were not resolvable from the containers.


Most likely the DNS containers had an issue starting. (Its probably not a DNS issue or network because we use the same /etc/resolv.conf as the host).

Here is a more detailed description of how this works but feel free to skip to the Diagnosing section. Here is how the DNS check works:
  1. Resolve the FQDN name from the linux host.
  2. Resolve the FQDN from within a test container.
  3. For the container to resolve the hostname the following occurs:
    1. Within the cluster the DNS service must be running. It runs in the kube-system namespace.
      1. Specifically its the rke2-coredns containers.
    2. The test container has to be able to communicate to the rke2 containers.
    3. The rke2 containers have to be able to talk to the DNS servers.
    4. Finally the DNS configuration has to be valid.
Points of failures could be:
  1. DNS configuration is not valid. (The resolv.conf might not have valid entries, no DNS is setup, /etc/hosts was used instead of actual DNS, etc).
  2. The DNS containers are not running or crashed or are erroring out.
  3. The test container does not start (but error message would be different).
  4. Container to container communication is not working.
  5. The DNS containers cannot communicate with the DNS servers.

Diagnosing:

  1. First, check that DNS is configured correctly from the hosts machine. The installer uses getent to check the configuration. This will account for entries made in /etc/hosts. Updating /etc/hosts is not a valid DNS configuration.
    1. Run: sudo cat /etc/hosts
    2. This will return the contents of /etc/hosts. There should be no entries corresponding to the automaitonsuite DNS.
      1. Example: We run cat /etc/hosts and it returns:
        1. 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
          ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
          127.0.0.1 automationsuite.uipath.com
          127.0.0.1 alm.automationsuite.uipath.com
      2. This would be an invalid configuration. Automationsuite requires proper DNS. Any entries corresponding the FQDN of automationsuite would need to be removed.
    3. Run: nslookup
      1. Example: nslookup automationsuite.uipath.com
      2. DIG can be used as well.
    4. This should return the IP address of the load balancer or for a single node, the node IP.
    5. If this returns a message that the server could not find the domain, then DNS has not been setup correctly.
      1. Work with your domain admin to setup the DNS entries: https://docs.uipath.com/automation-suite/docs/multi-node-configuring-the-dns
      2. Also remove the entries in /etc/hosts.
    6. If the request works, try testing out the other nameservers defined in your /etc/resolv.conf file.
      1. To see the other servers run: sudo cat /etc/resolv.conf
      2. This will return a list of servers. We are interested in the entries that have the following format: nameserver
      3. For example, the contents of the /etc/resolv.conf might look like:
        1. nameserver 168.63.129.16
          nameserver 1.1.1.1
          nameserver 8.8.8.8
      4. To test each server we would run:
        1. nslookup 168.63.129.16
        2. nslookup 1.1.1.1
        3. etc.
      5. If one of the nameservers does not work, let your linux admin know and remove the entry.
  2. If the DNS is configured from the host, the next options are to:
    1. Contact UiPath support. Before doing this generate an rke2 support bundle: The Rancher v2.x Linux log collector script
    2. Or we can check the state of the cluster to see if we can find the specific point of failure. While these steps are more advanced, they can be helpful in identifying the specific point of failure and the information should be helpful to someone who is managing the cluster.
  3. For the following tests kubectl needs to be enabled.
    1. Refer Enabling kubectl
    2. export KUBECONFIG="/etc/rancher/rke2/rke2.yaml" \
      > && export PATH="$PATH:/usr/local/bin:/var/lib/rancher/rke2/bin"
  4. Check the state of the DNS Pods.
    1. kubectl -n kube-system get pods -l k8s-app=kube-dns
      1. This command will list out the DNS pods (basically a container).
    2. Example output:
      • image.png
    3. They should have the STATUS Running. If one them is not in a Running state, most likely this is the cause of the failure.
    4. To determine why one is not in the running state, do the following:
      1. Choose a pod to check the status of. (The pod name is the first column in the output of the previous step and will look something like: rke2-coredns-rke2-coredns-XXXXX. i.e rke2-coredns-rke2-coredns-7bb4f446c-bj66k)
      2. Do a describe to see what events are associated with the pod:
        • kubectl -n kube-system describe pod
        • For example: kubectl -n kube-system describe pod rke2-coredns-rke2-coredns-7bb4f446c-bj66k
      3. At the end of the output will be a list of events. Look for warning or errors. Often these can explain the issue.
        1. Example output
          • image.png
      4. If there is an event that mentions a failed health check, go to the next step.
      5. If there are errors or warnings but the issue is not clear, generate an RKE2 support bundle by referring The Rancher v2.x Linux log collector script
  5. Check the DNS service for errors.
    1. If the pods are in a Running state or if they failed because of a health check, the next step would be to check the logs.
    2. kubectl -n kube-system logs -l k8s-app=kube-dns --since=1h
    3. If the pod is in a crashed state, it might be required to add "-p --tail=100" to the command. This is because the pod maybe restarting. In such a case, we want to look at the logs of the previous instance. Adding -p does this and adding the --tail=100 allows the output to return more than 10 lines.
      1. kubectl -n kube-system logs -l k8s-app=kube-dns -p --tail=100
    4. The logs may explain the issue. For example, it could be that that a UDP connection to the nameserver is being blocked. (If UDP connections are blocked, that would most likely be caused by security or firewall software)
    5. Again, if the issue is not clear generate an RKE support bundle and contact support. See: The Rancher v2.x Linux log collector script
  6. If the DNS containers seem healthy try to check the test containers connectivity to the DNS containers.
    1. kubectl -n istio-system exec ds/istio-ingressgateway -- nc -zv -w 1 rke2-coredns-rke2-coredns.kube-system 53
    2. The above command execs into the istio-ingressgateway container and checks the connectivity to the internal DNS server.
    3. This should return something like the following:
      1. rke2-coredns-rke2-coredns.kube-system.svc.cluster.local [10.43.0.10] 53 (?) open
    4. If it returns something about a connection timing out, that means containers cannot communicate with each other. This is usually caused by firewall applications.
      1. The default firewall is firewalld and its state can be checked with the command:
        1. firewall-cmd --state
        2. It should return: not running
    5. If the connection failed and firewalld was not running, talk with your linux admins to see if they know of what could be blocking the application. Additionally reach out to UiPath Support and provide the RKE2 support bundle.
    6. If this is a multinode instance and only one node is failing, then try checking the overlay network: Check If Overlay Network Is Functioning Correctly .
  7. The final check would be to just performance the same validation check the installer is performing manually.
    1. Make sure istio-ingressgateway is up.
      1. Run: kubectl -n istio-system get pods -l app=istio-ingressgateway
      2. Everything should be in a running state. If it is not, follow the equivalent steps as mentioned in the "Check the state of the DNS Pods" section.
    2. Run the container DNS resolution check command the installer uses
      1. kubectl -n istio-system exec ds/istio-ingressgateway -- getent ahosts
      2. Make sure to replace with the FQDN of the Automation Suite instance.
      3. This command tests the DNS resolution inside the containers.
      4. i.e: kubectl -n istio-system exec ds/istio-ingressgateway -- getent ahosts autosuite.uipath.devtest
      5. Example output:
        • image.png
      6. The test can be done on each automation suite URL (See Multi Node Configuring The DNS)
    3. The above commands should match to the equivalent command ran on the host machine:
      1. getent ahosts
      2. i.e. getent ahosts autosuite.uipath.devtest
    4. If the above commands are successful and the output of the commands ran on the host do not match the the commands ran in the container, check the contents of /etc/hosts it maybe that there are some entries messing with DNS.
    5. If the above commands have the same output, try running the installer again. It could be the core DNS container took longer to start.
    6. If the command fails with "command terminated with exit code 2" it means nothing was returned.
      1. This scenario should be addressed by the above steps.
  8. If we have made it this far, contact support and include the RKE2 support bundle: The Rancher v2.x Linux log collector script