The Insecure Wire

a network engineers perspective

Troubleshooting VMware lost access to volume errors

Recently we have been seeing heaps of these types of errors fill up our VMware ESXi logs:

2015-07-02T02:00:11.282Z cpu10:36273)HBX: 2832: Waiting for timed out [HB state abcdef02 offset 3444736 gen 549 stampUS 115704005679 uuid 5592d754-21d7d8a7-0a7e-a0369f519998 jrnl drv 14.60] on vol 'example datastore'

And in VCenter task / events for host:

Lost access to volume 5592d754-21d7d8a7-0a7e-a0369f519998 (LUN 211) following connectivity issues. Recovery attempt in progress and outcome will be reported shortly.
Successfully restored access to volume 5592d754-21d7d8a7-0a7e-a0369f519998 (LUN 211) following connectivity issues.

When ESXi disconnects and reconnects storage consistently – there is a problem (obviously). Even in milliseconds this will impact performance of your workloads. I tried to follow the support steps in VMware KB 2136081, to no avail.

So to break down this issue logically, it can be one of three things:
1. The host (drivers, firmware, hardware etc)
2. The SAN – EMC VNX array
3. The switching fabric – we use Nexus switches with FCoE to the UCS fabric interconnects – storage and data share the same switching.

Hosts
I ruled out the UCS hosts by installing the newest storage drivers for ESXi 6.x. I also setup a new LUN and made sure no VMs were on it – meaning no I/O and still the driver is reporting the dreaded time out error message.
Storage
Next was the SAN, since its managed with EMC Support, I was able to log an SR with support for the storage processor log analysis. This came back with the following:

[RSCN : Registered State Change Notification:]
SPB Interface:1 RSCN count:4
SPB Interface:0 RSCN count:4

fffffd : well known address of the Management server on the switch:

B 07/07/19 14:56:46 364afc0 FCOEDMQL 0 (FE4) DVM RSCN Recd - Id: 0xfffffd, Device State: 0x3
B 07/07/19 14:56:46 36bbfc0 FCOEDMQL 1 (FE5) DVM RSCN Recd - Id: 0xfffffd, Device State: 0x3
B 07/07/19 14:58:47 36bbfc0 FCOEDMQL 1 (FE5) DVM RSCN Recd - Id: 0xfffffd, Device State: 0x3
B 07/07/19 14:58:47 364afc0 FCOEDMQL 0 (FE4) DVM RSCN Recd - Id: 0xfffffd, Device State: 0x3

For further information:
RSCN and aborts due to DEVICE_RESET for reason KILLING_RSCN_AFFECTED_DEVICE
When a RSCN (Registered State Change Notification) is received, the array makes a comparison of the nodes that are "logged in" as compared to the list sent from the name server (the switch).
The RSCN is a notification delivered from the switch to the array telling the array that something happened on the fabric.
The array responds to this by deleting any connection affected by the RSCN notification. The host would need to log in again.

So from this I was able to deduce that the switching was sending ‘RSCN notification’ to both SAN SPB ports. This is causing the behavior in ESXi wherein access to LUNs is timing out and reconnecting and repeating. At this point we know there is something wrong with the switching fabric.

Switching Fabric
To find the issue on the fabric switching I went through standard troubleshooting techniques. Check physical layer first. Everything was connected and UP/UP.
Digging further through the logging on the Nexus switching I saw that from cold boot there were SFP+ transceivers reported as not supported. Even though the ports were UP and the unsupported command was entered in the switch config.
As above the RCSN logout errors where only happening on certain ports on the SAN side. I checking all four down links to the san from the nexus side – they all had third party SFPs. I decided to switch them all out for Cisco branded SR SFP+.

And with that the issue was resolved. Third party optics worked with this setup for years, I suspect 2/4 were faulty. Either way its a lesson for critical links it’s probably best to use the branded optics where possible.