How to work around packet drop issue when setting up multi-node Automation Suite

If you are in the process of setting up multi-node Automation Suite on VMware’s vSphere nodes and, after finishing kubernetes setup ( -k), you are stuck in the fabric installation ( -f), the first thing you need to check is the port availability as described in the following page:

If you are 100% sure that all the ports were enabled including 8472/UDP, then it could be due to the bug in VMware’s ESXi kernel randomly dropping some UDP packets thereby leading to issues with longhorn-csi-plugin pods. If this is suspected, you can try disabling checksum on both the network adaptor and the cilium_host.

Run the following commands on the nodes where its longhorn-csi-plugin pod is in CrashLoopBackOff state:

$ sudo ethtool -K <eth_interface> tx-checksum-ip-generic off
$ sudo ethtool -K cilium_host tx-checksum-ip-generic off

Replace <eth_interface> with the actual interface name on the node, which often is ens192 or ens256 or similar. An easy way to list all the network interfaces that are present on the node is to run ifconfig command.

Once you have run these commands on all the nodes, restart longhorn-csi-plugin pods by deleting them using the following command:

$ kubectl delete -n longhorn-system deploy/longhorn-csi-plugin

If successful, all the pods should come back in Running state.

1 Like