The great misunderstanding of MPIO

Dual HBA’s, Dual Fabrics, redundant cache, RAID-ed disks, dual controllers or switched matrices, HA X-bars, multipath software installed and all OS drivers, firmware, microcode etc etc is up-to-date. In other words you’re all sorted and you can sleep well tonight.

And then Murphy strikes……..

As I’ve described in my previous articles it takes one single misbehaving device to really screw up a storage environment. Congestion and latency will, at some point in time, cause FC frames to go into the bit-bucket hence causing one or multiple IO errors. So what is exactly an IO error?

When an application want to read or write data it does this (in the open-systems world) via a SCSI command. (I’ll leave the device specific commands for later)
This command is than mapped at the FC4 layer into FC frames which then travel via the FC network to the target.

So lets take for example a database application that needs to read an piece of data. This is never done in chunks of a couple of bytes like single rows but it is always done with a certain size. This depends on the configuration of the application. For arguments sake lets assume the database uses 8KB IO sizes. So the read command is issued against a LUN on the SCSI layer more-or-less outlines the LUN id, the offset and the block-count from that offset. So for a single read-request an 8KB read is done on the array. Since a fibre channel frame holds only 2 KB, this IO is split into 4 FC frames which are linked via, so called, sequence id’s. (I’ll spare you the entire handling on exchanges, sequences etc….). So if one of these frames are dropped somewhere under way we’re missing 2K out the total 8K. this means that for example frame 1, 2 and 4 have arrived back at the HBA but before the HBA can forward this to the SCSI layer is has to wait for frame 3 to arrive to be able to re-assemble the full IO. If frame 3 was dropped for whatever reason, the HBA has to wait for a pre-determined time before it will flag the IO as incomplete and will thus mark the entire FC exchange as invalid and will send and abort message with a certain status code to the SCSI layer. This will trigger the SCSI layer to retry the IO again and as such will consume the same resources on the system, FC fabric and storage array as the original request. You can imagine this can, and in many occasions will, cause performance issues or, in even more subsequent occurrences, an application failure.

Now when you look at the above traffic-flow all this time there has not been a single indication that the actual physical or logical path has disappeared between the HBA and the storage port. No HBA, storage or switch port has gone offline. The above was just the result of frames being dropped due to congestion, latency or any other reason. This will not trigger any MPIO software to logically remove a path and thus it will just keep on sending IO’s to the target over a path that may be somewhat erroneous. Again, it is NOT the purpose of MPIO to monitor and act up IO-errors.

If you are able to identify which path observes these errors you can disable this path from the MPIO software and you can fix the problem path at your earliest convenience. As I mentioned above this kind of behaviour very often occurs during Murphy time ie. during your least convenient time. This means that you will get called during your beauty sleep at 3:00AM with a message that you’re entire ERP application is down and that 4 factories and 3 logistics distributions centre’s are picking their nose at $20,000 a minute.

So what happens when a real path problem is observed. Basically it means that a physical or logical issue occurred somewhere down the line. This can be a physical issue like a broken cable or SFP but also a bit- or word synchronisation issue between two ports in that path. This will trigger the switch to send a so called RSCN (Registered State Change Notification) to be sent to all ports in the same fabric and zone as the one that observed the problem. (Now, this also depends on the RSCN state registration of those devices but these are 99% of the time OK). This RSCN contains all 24-bit fabric addresses which are affected. (There can be more than one of course when ISL’s are involved.)

As soon as this RSCN arrives at the initiator the HBA will disassemble it and notify the upper layer of this change. This is done with different status codes as the IO errors as I described above. Based up the 24-bit fabric ID’s MPIO can then determine which path to that particular target and LUN was affected and as such can take it off-line. There can still be one or more IO errors as this depends on how many were in-flight during the error.

So what is the solution. As always the best way is to prevent this kind of troublesome scenario’s. Make sure you keep an eye on error counters and immediately fix these devices. If for some reason a device starts to behave this way during your beauty sleep, you need to make sure beforehand it will not further impact the rest of the environment. You can do this by disabling a ports either on the switch, HBA or storage port but that depends on where the problem is observed. Use tools that are build into the software like NX-OS or FOS to identify these troublesome links and disable them with features like portfencing. Although it might still have some impact this is nothing compared to a ongoing issue which might take hours or even days to identify.

As always use the manuals to determine how to set this up. If you’re inside the HDS network you can access a tool I wrote to very easily generate the portfencing configuration. Send me an email about this if you’re interested.

Hope this explains a bit the difference between IO errors and path problems w.r.t. MPIO and removes the confusion of what MPIO is intended to do.

Kind regards,
Erwin

P.S. for those unaware, MPIO (Multi Path IO) is software that maps multiple paths to target and LUNs to a single logical entity on a host so it can use all those paths to address that target/LUN. Software like Hitachi Dynamic Link Manager, EMC Powerpath, HP SecurePath and Veritas DMP fall into this category.

12 responses on “The great misunderstanding of MPIO”

Yury Yavorsky 12/11/2014 at 03:15

hello Erwin! Could I ask you to get back on this once more. You tack about RSCN as the trigger for path offline. If so why it is online till real I/O or SCSI TUR will occur? This is also stated in HDLM documentation – “Without path health checking, an error cannot be detected unless an I/O
operation is performed, because the system only checks the status of a path
when an I/O operation is performed. With path health checking, however, the
system can check the status of all online paths at regular intervals regardless
of whether I/Os operations are being performed.”

evlonden 12/11/2014 at 12:41

Hello Yury,

When things go wrong in a fabric there is no guarantee that the RSCN will reach all ports in the fabric. The TUR or the real IO request is then a fallback method for MPIO to take the path offline. If the RSCN has reached the HBA or storage port you can safely assume that path will be taken offline immediately.

Hope this helps,

Regards,
Erwin
1. Yury Yavorsky 12/11/2014 at 16:05
  
  One situation when RSCN will not reach the port is that link itself from the HBA port to switch is broken. Could you tell me about another situations? Suppose in normal environment RSCN should work fine for fabric events that the particular port is registered for (SCR Value).
  1. evlonden 13/11/2014 at 10:40
    
    Hi Yury,
    
    Indeed. There has been a defect in FOS which inadvertently modified the SCR value in the nameserver registration. This caused RSCN’s not to be sent to the device in question. Obviously the same thing can happen when a device has a firmware/driver bug and not properly handles the SCR registration or does not act properly upon a reception of an RSCN. These are however corner cases.
    1. Yury Yavorsky 14/11/2014 at 18:10
      
      Thanx, Erwin
2. Roman Sozinov 16/11/2014 at 05:10
  
  Erwin,
  
  In our environment (RHEL 5.10 with Native MPIO) we have the situation when host gets RSCN (when storage port is offline), but does not block path to that storage port immediately.
  
  It waits for dev_loss_tmo seconds (or other scsi timeout value based on scsi error handler) instead.
  
  Do you want to say that something definitely wrong in our configuration and if RSCN has been got MPIO has to block path to that storage port immediately (without waiting for any scsi eh timeouts)?
  1. evlonden 17/11/2014 at 13:41
    
    Hello Roman,
    
    No, your environment is behaving as expected. There are a number of reasons why a path is not market as failed immediately because an RSCN may be sent due to a number of reasons, a failed link is only one of them. There are two timers who control this. One you already mentioned and the other is fast_io_fail_tmo. Please see the documentation on when and how to use these.
    
    Hope this helps.
    
    Kind regards,
    Erwin
    1. Roman Sozinov 17/11/2014 at 22:59
      
      Thanks Erwin for reply.
      I understand that RSCN may be sent due to a number of reasons.
      But let’s take a look storage port offline case – shouldn’t MPIO block failed paths immediately right after it gets RSCN about storage port is offline?
    2. evlonden 19/11/2014 at 11:30
      
      🙂 There is no RSCN for a “Port-Offline”. The RSCN can contain a few options one of which is “Changed Name Server object”. This is the RSCN if any of the ports where the HBA is zoned to does change its state in any way. Doesn’t necessarily have to be “Offline”. It is then up to the recipient of the RSCN to determine what has changed. This might take a while. There is also an RSCN which is “REMOVED OBJECT” but I’m not 100% if that one is used currently. It depends on the RSCN event qualifier being used by FOS. That in turn depends on the SCN registration the recipient has sent to the fabric controller upon NS registration.
    3. Roman Sozinov 19/11/2014 at 19:14
      
      Thanks again for clarification 🙂
      Just trying to identify what level (fabric, HBA driver or MPIO) is responsible for latency problem we have in case of storage port failure.

Erwin van Londen 02/10/2012 at 03:56

Hi Seb,

Indeed TUR is one option, a read-sector-0 another. Problem is that if only 0.5% of frames are affected you still have 99.5% chance of TUR/RS-0 to succeed. Secondly if one of these TUR/RS-0 are going to lala-land there will always be a re-try of that same command which also has a 99.5% of succeeding. As you said this is not a very reliable way to determine and end-to-end path status.

I’m working with some folks in T11 to get this addressed but it might take a while since some new frames need to be introduced. These protocol changes are always cumbersome and even if they are rectified we still need the vendors to implement it. Still might take a few years.

Cheers,
Erwin

Sebastian Thaele (@Zyrober) 01/10/2012 at 11:46

Good article! It’s a good wake-up call for admins relying on an assumed “intelligence” in the multipather to save them not only from hard connectivity problems but also from “soft” problems like much longer latency or frame drops. Indeed there is an additional mechanism (often called “Heartbeat” or “Healthcheck”) in many multipathers. It sends TURs (SCSI Test Unit Ready) repeatedly and could for example set the path to dead if the TUR fails. But in the field, if you have a performance problem, often only a small number of frames are really dropped – although this “small number” would be a pain, too, as it will eventually lead to error recovery with all its nasty timeouts. If the TUR was not affected, the multipather just doesn’t recognize it. And even when it hits a TUR and the path is seen as “not reliable anymore” – just some successfully completed TURs later it will be used again although the performance problem is still there. So unfortunately TURs are not the solution of the problem.
In the mainframe world there are some new mechanisms to cope with the problem as Brocade’s Dr. Guendert explains here: http://community.brocade.com/community/brocadeblogs/mainframe/blog/2012/09/14/the-ibm-zenterprise-ec12-brocade-dcx-8510-sportscar-tire-extraploation

Let’s see what the open world guys come up with.

Yury Yavorsky 12/11/2014 at 03:15

hello Erwin! Could I ask you to get back on this once more. You tack about RSCN as the trigger for path offline. If so why it is online till real I/O or SCSI TUR will occur? This is also stated in HDLM documentation – “Without path health checking, an error cannot be detected unless an I/O
operation is performed, because the system only checks the status of a path
when an I/O operation is performed. With path health checking, however, the
system can check the status of all online paths at regular intervals regardless
of whether I/Os operations are being performed.”
1. evlonden 12/11/2014 at 12:41
  
  Hello Yury,
  
  When things go wrong in a fabric there is no guarantee that the RSCN will reach all ports in the fabric. The TUR or the real IO request is then a fallback method for MPIO to take the path offline. If the RSCN has reached the HBA or storage port you can safely assume that path will be taken offline immediately.
  
  Hope this helps,
  
  Regards,
  Erwin
  1. Yury Yavorsky 12/11/2014 at 16:05
    
    One situation when RSCN will not reach the port is that link itself from the HBA port to switch is broken. Could you tell me about another situations? Suppose in normal environment RSCN should work fine for fabric events that the particular port is registered for (SCR Value).
    1. evlonden 13/11/2014 at 10:40
      
      Hi Yury,
      
      Indeed. There has been a defect in FOS which inadvertently modified the SCR value in the nameserver registration. This caused RSCN’s not to be sent to the device in question. Obviously the same thing can happen when a device has a firmware/driver bug and not properly handles the SCR registration or does not act properly upon a reception of an RSCN. These are however corner cases.
      1. Yury Yavorsky 14/11/2014 at 18:10
        
        Thanx, Erwin
  2. Roman Sozinov 16/11/2014 at 05:10
    
    Erwin,
    
    In our environment (RHEL 5.10 with Native MPIO) we have the situation when host gets RSCN (when storage port is offline), but does not block path to that storage port immediately.
    
    It waits for dev_loss_tmo seconds (or other scsi timeout value based on scsi error handler) instead.
    
    Do you want to say that something definitely wrong in our configuration and if RSCN has been got MPIO has to block path to that storage port immediately (without waiting for any scsi eh timeouts)?
    1. evlonden 17/11/2014 at 13:41
      
      Hello Roman,
      
      No, your environment is behaving as expected. There are a number of reasons why a path is not market as failed immediately because an RSCN may be sent due to a number of reasons, a failed link is only one of them. There are two timers who control this. One you already mentioned and the other is fast_io_fail_tmo. Please see the documentation on when and how to use these.
      
      Hope this helps.
      
      Kind regards,
      Erwin
      1. Roman Sozinov 17/11/2014 at 22:59
        
        Thanks Erwin for reply.
        I understand that RSCN may be sent due to a number of reasons.
        But let’s take a look storage port offline case – shouldn’t MPIO block failed paths immediately right after it gets RSCN about storage port is offline?
      2. evlonden 19/11/2014 at 11:30
        
        🙂 There is no RSCN for a “Port-Offline”. The RSCN can contain a few options one of which is “Changed Name Server object”. This is the RSCN if any of the ports where the HBA is zoned to does change its state in any way. Doesn’t necessarily have to be “Offline”. It is then up to the recipient of the RSCN to determine what has changed. This might take a while. There is also an RSCN which is “REMOVED OBJECT” but I’m not 100% if that one is used currently. It depends on the RSCN event qualifier being used by FOS. That in turn depends on the SCN registration the recipient has sent to the fabric controller upon NS registration.
      3. Roman Sozinov 19/11/2014 at 19:14
        
        Thanks again for clarification 🙂
        Just trying to identify what level (fabric, HBA driver or MPIO) is responsible for latency problem we have in case of storage port failure.
Erwin van Londen 02/10/2012 at 03:56

Hi Seb,

Indeed TUR is one option, a read-sector-0 another. Problem is that if only 0.5% of frames are affected you still have 99.5% chance of TUR/RS-0 to succeed. Secondly if one of these TUR/RS-0 are going to lala-land there will always be a re-try of that same command which also has a 99.5% of succeeding. As you said this is not a very reliable way to determine and end-to-end path status.

I’m working with some folks in T11 to get this addressed but it might take a while since some new frames need to be introduced. These protocol changes are always cumbersome and even if they are rectified we still need the vendors to implement it. Still might take a few years.

Cheers,
Erwin
Sebastian Thaele (@Zyrober) 01/10/2012 at 11:46

Good article! It’s a good wake-up call for admins relying on an assumed “intelligence” in the multipather to save them not only from hard connectivity problems but also from “soft” problems like much longer latency or frame drops. Indeed there is an additional mechanism (often called “Heartbeat” or “Healthcheck”) in many multipathers. It sends TURs (SCSI Test Unit Ready) repeatedly and could for example set the path to dead if the TUR fails. But in the field, if you have a performance problem, often only a small number of frames are really dropped – although this “small number” would be a pain, too, as it will eventually lead to error recovery with all its nasty timeouts. If the TUR was not affected, the multipather just doesn’t recognize it. And even when it hits a TUR and the path is seen as “not reliable anymore” – just some successfully completed TURs later it will be used again although the performance problem is still there. So unfortunately TURs are not the solution of the problem.
In the mainframe world there are some new mechanisms to cope with the problem as Brocade’s Dr. Guendert explains here: http://community.brocade.com/community/brocadeblogs/mainframe/blog/2012/09/14/the-ibm-zenterprise-ec12-brocade-dcx-8510-sportscar-tire-extraploation

Let’s see what the open world guys come up with.