Tag Archives: HDLM

The great misunderstanding of MPIO

Dual HBA’s, Dual Fabrics, redundant cache, RAID-ed disks, dual controllers or switched matrices, HA X-bars, multipath software installed and all OS drivers, firmware, microcode etc etc is up-to-date. In other words you’re all sorted and you can sleep well tonight.

And then Murphy strikes……..

As I’ve described in my previous articles it takes one single misbehaving device to really screw up a storage environment. Congestion and latency will, at some point in time, cause FC frames to go into the bit-bucket hence causing one or multiple IO errors. So what is exactly an IO error?

When an application want to read or write data it does this (in the open-systems world) via a SCSI command. (I’ll leave the device specific commands for later)
This command is than mapped at the FC4 layer into FC frames which then travel via the FC network to the target.

So lets take for example a database application that needs to read an piece of data. This is never done in chunks of a couple of bytes like single rows but it is always done with a certain size. This depends on the configuration of the application. For arguments sake lets assume the database uses 8KB IO sizes. So the read command is issued against a LUN on the SCSI layer more-or-less outlines the LUN id, the offset and the block-count from that offset. So for a single read-request an 8KB read is done on the array.  Since a fibre channel frame holds only 2 KB, this IO is split into 4 FC frames which are linked via, so called, sequence id’s. (I’ll spare you the entire handling on exchanges, sequences etc….). So if one of these frames are dropped somewhere under way we’re missing 2K out the total 8K. this means that for example frame 1, 2 and 4 have arrived back at the HBA but before the HBA can forward this to the SCSI layer is has to wait for frame 3 to arrive to be able to re-assemble the full IO. If frame 3 was dropped for whatever reason, the HBA has to wait for a pre-determined time before it will flag the IO as incomplete and will thus mark the entire FC exchange as invalid and will send and abort message with a certain status code to the SCSI layer. This will trigger the SCSI layer to retry the IO again and as such will consume the same resources on the system, FC fabric and storage array as the original request. You can imagine this can, and in many occasions will, cause performance issues or, in even more subsequent occurrences, an application failure.

Now when you look at the above traffic-flow all this time there has not been a single indication that the actual physical or logical path has disappeared between the HBA and the storage port. No HBA, storage or switch port has gone offline. The above was just the result of frames being dropped due to congestion, latency or any other reason. This will not trigger any MPIO software to logically remove a path and thus it will just keep on sending IO’s to the target over a path that may be somewhat erroneous. Again, it is NOT the purpose of MPIO to monitor and act up IO-errors.

If you are able to identify which path observes these errors you can disable this path from the MPIO software and you can fix the problem path at your earliest convenience. As I mentioned above this kind of behaviour very often occurs during Murphy time ie. during your least convenient time. This means that you will get called during your beauty sleep at 3:00AM  with a message that you’re entire ERP application is down and that 4 factories and 3 logistics distributions centre’s are picking their nose at $20,000 a minute.

So what happens when a real path problem is observed. Basically it means that a physical or logical issue occurred somewhere down the line. This can be a physical issue like a broken cable or SFP but also a bit- or word synchronisation issue between two ports in that path. This will trigger the switch to send a so called RSCN (Registered State Change Notification) to be sent to all ports in the same fabric and zone as the one that observed the problem. (Now, this also depends on the RSCN state registration of those devices but these are 99% of the time OK). This RSCN contains all 24-bit fabric addresses which are affected. (There can be more than one of course when ISL’s are involved.)

As soon as this RSCN arrives at the initiator the HBA will disassemble it and notify the upper layer of this change. This is done with different status codes as the IO errors as I described above. Based up the 24-bit fabric ID’s MPIO can then determine which path to that particular target and LUN was affected and as such can take it off-line. There can still be one or more IO errors as this depends on how many were in-flight during the error.

So what is the solution. As always the best way is to prevent this kind of troublesome scenario’s. Make sure you keep an eye on error counters and immediately fix these devices. If for some reason a device starts to behave this way during your beauty sleep, you need to make sure beforehand it will not further impact the rest of the environment. You can do this by disabling a ports either on the switch, HBA or storage port but that depends on where the problem is observed. Use tools that are build into the software like NX-OS or FOS to identify these troublesome links and disable them with features like portfencing. Although it might still have some impact this is nothing compared to a ongoing issue which might take hours or even days to identify.

As always use the manuals to determine how to set this up. If you’re inside the HDS network you can access a tool I wrote to very easily generate the portfencing configuration. Send me an email about this if you’re interested.

Hope this explains a bit the difference between IO errors and path problems w.r.t. MPIO and removes the confusion of what MPIO is intended to do.

Kind regards,
Erwin

P.S. for those unaware, MPIO (Multi Path IO) is software that maps multiple paths to target and LUNs to a single logical entity on a host so it can use all those paths to address that target/LUN. Software like Hitachi Dynamic Link Manager, EMC Powerpath, HP SecurePath and Veritas DMP fall into this category.