VMWare ESX4 guests loosing network connectivity briefly.

Came across a very odd issue lately where guests on one of our ESX4 hosts were periodically loosing network connectivity very briefly – maybe 10 ICMP packets every half hour or hour.

After much debugging on the network side, thinking that perhaps there was a misconfigured NIC with the wrong VLAN config, the problem was still happening.

So ssh’ing onto the host, I started to trawl through the log files, and came across the below in the /var/log/vmkwarning file:

Feb 17 13:44:19 vminfraboxvmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6090a028004f243d08ab44c26687e3dd” – issuing command 0x410002074040
Feb 17 13:44:19¬†vminfrabox vmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6090a028004f243d08ab44c26687e3dd” – failed to issue command due to Not found (APD), try again…
Feb 17 13:44:19 vminfrabox vmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device “naa.6090a028004f243d08ab44c26687e3dd”: awaiting fast path state update…

This was occuring repeatedly every half hour and the entries above filled the logs solidly for about 2 minutes continuously every half an hour.

After doing some digging on the google, I found out that ESX4 has a bug whereby if you have a duff or old connection to an iSCSI LUN – perhaps one that no longer exists – but you never rescanned to remove it – when the host tries to check the paths every 30 minutes, it finds this duff connection and goes through the motions of trying to find failover paths. The bug is that this causes very brief network loss to your guests.

The fix for me was to simply re-scan my adapter, which removed the old mapping to one of our removed LUNS’s and the problem went away.

You can follow any responses to this entry through the RSS 2.0 feed.

Comments

  1. […] can follow any responses to this entry through the RSS 2.0 feed. via […]

  2. On March 17, 2010 Eric Snell says:

    We just had the problem today and it affected all of our production VMHosts (over 30 guests) every 30 minutes. The solution above works; however, to assure the problem is ‘gone’ we have to do a clean reboot of the ESX servers that still see the dead path.

    VMWare has a patch for this problem now which we are going to apply after we get this initial problem fixed.

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This blog is kept spam free by WP-SpamFree.