Came across a very odd issue lately where guests on one of our ESX4 hosts were periodically loosing network connectivity very briefly – maybe 10 ICMP packets every half hour or hour.
After much debugging on the network side, thinking that perhaps there was a misconfigured NIC with the wrong VLAN config, the problem was still happening.
So ssh’ing onto the host, I started to trawl through the log files, and came across the below in the /var/log/vmkwarning file:
Feb 17 13:44:19 vminfraboxvmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6090a028004f243d08ab44c26687e3dd” – issuing command 0x410002074040
Feb 17 13:44:19 vminfrabox vmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6090a028004f243d08ab44c26687e3dd” – failed to issue command due to Not found (APD), try again…
Feb 17 13:44:19 vminfrabox vmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device “naa.6090a028004f243d08ab44c26687e3dd”: awaiting fast path state update…
This was occuring repeatedly every half hour and the entries above filled the logs solidly for about 2 minutes continuously every half an hour.
After doing some digging on the google, I found out that ESX4 has a bug whereby if you have a duff or old connection to an iSCSI LUN – perhaps one that no longer exists – but you never rescanned to remove it – when the host tries to check the paths every 30 minutes, it finds this duff connection and goes through the motions of trying to find failover paths. The bug is that this causes very brief network loss to your guests.
The fix for me was to simply re-scan my adapter, which removed the old mapping to one of our removed LUNS’s and the problem went away.