Wednesday, April 8, 2015

vMotion Fails - Failed to connect to remote host. Network unreachable.

When migrating a VM using vMotion, the migration may stall at 14% and eventually fail with the following error:

Migration [xxxx:xxxx] failed to connect to remote host <x.x.x.x> from host <y.y.y.y>: Network unreachable.

Usually this is a pretty straightforward fix: correct whatever network issue is preventing communication between the vmkernel ports.  However, I recently encountered an issue where the network was configured properly, traffic was flowing, and vMotion still failed.

Everything with the multi-NIC vMotion config checked out:
  • Two separate VMkernel ports on the relevant vSwitch with IPs on the same subnet.
  • One vmnic active and one standby for each VMkernel port.
  • Active/standby adapters on the second VMkernel port were the inverse of the first.
  • vMotion enabled on the vMotion VMkernel ports.
  • 9000 MTU on each vSwitch and VMkernel port
  • 9000 MTU on the relevant switch and switchports.
  • Relevant switchports tagged for the appropriate VLAN.



(The configuration is pretty straightforward, as outlined in the VMware KB: kb.vmware.com/kb/2007467)


In addition, vmkpings from the affected host succeeded.  Each vMotion vmkernel port on host A was able to ping each vMotion vmkernel port on host B with a packet size of 8972 and the df-bit set:
[root@pe1:~] vmkping -I vmk2 -s 8972 -d 10.24.4.4
PING 10.24.4.4 (10.24.4.4): 8972 data bytes
8980 bytes from 10.24.4.4: icmp_seq=0 ttl=64 time=0.571 ms
8980 bytes from 10.24.4.4: icmp_seq=1 ttl=64 time=0.860 ms
8980 bytes from 10.24.4.4: icmp_seq=2 ttl=64 time=0.590 ms

--- 10.24.4.4 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.571/0.674/0.860 ms
[root@pe1:~] vmkping -I vmk2 -s 8972 -d 10.24.4.5
PING 10.24.4.5 (10.24.4.5): 8972 data bytes
8980 bytes from 10.24.4.5: icmp_seq=0 ttl=64 time=0.588 ms
8980 bytes from 10.24.4.5: icmp_seq=1 ttl=64 time=0.666 ms
8980 bytes from 10.24.4.5: icmp_seq=2 ttl=64 time=0.625 ms

--- 10.24.4.5 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.588/0.626/0.666 ms
[root@pe1:~] vmkping -I vmk3 -s 8972 -d 10.24.4.4
PING 10.24.4.4 (10.24.4.4): 8972 data bytes
8980 bytes from 10.24.4.4: icmp_seq=0 ttl=64 time=0.538 ms
8980 bytes from 10.24.4.4: icmp_seq=1 ttl=64 time=0.706 ms
8980 bytes from 10.24.4.4: icmp_seq=2 ttl=64 time=0.909 ms

--- 10.24.4.4 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.538/0.718/0.909 ms
[root@pe1:~] vmkping -I vmk3 -s 8972 -d 10.24.4.5
PING 10.24.4.5 (10.24.4.5): 8972 data bytes
8980 bytes from 10.24.4.5: icmp_seq=0 ttl=64 time=0.593 ms
8980 bytes from 10.24.4.5: icmp_seq=1 ttl=64 time=0.653 ms
8980 bytes from 10.24.4.5: icmp_seq=2 ttl=64 time=0.664 ms

--- 10.24.4.5 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.593/0.637/0.664 ms
[root@pe1:~]

So if vMotion was configured properly, why did it fail?  It turns out, a non-vMotion vmkernel port was mistakenly configured for vMotion as well (in this case, one of the iSCSI vmkernel ports):


So while everything checked out with the vMotion vmkernel port configuration, vSwitch configuration, and physical switch configuration, that one mistakenly checked box broke vMotion.

This was actually clearly stated in the logs:
Migration [169353221:1428542576352168] failed to connect to remote host <10.24.4.2> from host <10.24.32.5>: Network unreachable.
vMotion migration [169353221:1428542576352168] failed to create a connection with remote host <10.24.4.2>: The ESX hosts failed to connect over the VMotion network
The vMotion migrations failed because the ESX hosts were not able to connect over the vMotion network. Check the vMotion network settings and physical network configuration. 
The vMotion failed because the destination host did not receive data from the source host on the vMotion network. Please check your vMotion network settings and physical network configuration and ensure they are correct.

However, the IP scheme of the vMotion subnet was very similar to the iSCSI subnet so I overlooked it initially.

This solution is also documented in the following KB article: kb.vmware.com/kb/2042654

"The incorrect vmkernel interface may be selected for vMotion. The ESX/ESXi host uses only the selected interface for vMotion. If a vmkernel interface is in the incorrect IP subnet, or if the physical network is not configured correctly, the vMotion vmkernel interface may not be able to communicate with the destination host."

No comments:

Post a Comment