AI Center Linux Host IP Change

What steps are needed in order to get AI Center back up, after IP of the Linux host is changed?

Issue Description: After the IP ( private IP for Azure and AWS ) was changed on the Linux host, AI Center is no longer available and the message returned by kubectl commands is : "unable to connect to <old Linux IP>. Did you specify the correct port?" .

Resolution: Perform the below steps

  1. Create backup folders of /etc/kubernetes and /var/lib/kubelet
  2. Go to /etc/kubernetes/pki and /etc/kubernetes/pki/etcd and delete all certificates and keys, EXCEPT those for CA and SA
  3. Delete all .conf files located in /etc/kubernetes
  4. Run the following commands:
  • kubeadm init phase certs apiserver --apiserver-advertise-address <new host ip> --apiserver-cert-extra-sans <new host ip> ( for AWS you might need to add --node-name ip-xx-xx-xx-xx.ec2.internal for the certificate to be valid )
  • kubeadm init phase certs apiserver-kubelet-client
  • kubeadm init phase certs apiserver-etcd-client
  • kubeadm init phase certs etcd-healthcheck-client
  • kubeadm init phase certs etcd-peer
  • kubeadm init phase certs etcd-server
  • kubeadm init phase certs front-proxy-server
  • kubeadm init phase certs front-proxy-client
  • kubeadm init phase kubeconfig all --apiserver-advertise-address <new node ip> ( for AWS you might need to add --node-name ip-xx-xx-xx-xx.ec2.internal )
  1. Locate the kubeadm.conf file, located in the /opt/replicated folder, edit it, and change all instances of the old ip to the new host ip.
  2. Run the command kubeadm init --config=/opt/replicated/kubeadm.conf --ignore-preflight-errors=All
  3. After the above 6th step finishes, run kubectl get nodes -o wide and check that the new ip is seen as the node internal IP. If this is not the case, then
  • add --node-ip <new ip> as a parameter in the kubelet service file
  • run systemctl daemon-reload and systemctl restart kubelet.

After the node is up and running, with the new IP, but containers are either in creating or init state, with an error similar to:

Warning DNSConfigForming 25s kubelet, ngrpa-ap-prd-03 Search Line limits were exceeded, some search paths have been omitted, the applied search line is: default.svc.cluster.local svc.cluster.local cluster.local Warning FailedCreatePodSandBox 24s kubelet, ngrpa-ap-prd-03 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "e6d31a898f86c1aa402c6e75c3f33553ad5b7f9c9c67a06d7de8f3b7ddb17125" network for pod "kotsadm-976f7bf85-lkbnp": networkPlugin cni failed to set up pod "kotsadm-976f7bf85-lkbnp_default" network: unable to allocate IP address: Post dial tcp connect: connection refused, failed to clean up sandbox container "e6d31a898f86c1aa402c6e75c3f33553ad5b7f9c9c67a06d7de8f3b7ddb17125" network for pod "kotsadm-976f7bf85-lkbnp": networkPlugin cni failed to teardown pod "kotsadm-976f7bf85-lkbnp_default" network: Delete dial tcp connect: connection refused] Normal SandboxChanged 11s (x2 over 24s) kubelet, ngrpa-ap-prd-03 Pod sandbox changed, it will be killed and re-created.

  1. Run the command "sudo docker ps -a", and list all docker containers.
  2. If you see lots of exited containers, run the command "sudo docker container prune" to delete all of the exited containers, as these are the reason for which IP addresses can no longer be assigned

If the issue is still not fixed, after pruning the containers, check if the weaver pod, located in the kube-system namespace is up and running. If the pod is not running, debugging needs to be done to verify the reason.