One rotten apple spoils the bunch - 2

As mentioned in my previous post it only takes a single device to really cause some serious havoc in a storage environment. Now, “Why”, you may ask, do we have all these redundant kit in our environment like dual fabrics, redundant controllers, dual HBA’s , MPIO software etc whilst this “slow drain device” is the absolute “Achilles heel” of the entire storage infrastructure.

Well, lets take a step back why it has come to this point. As with most hardware and software it develops over time so when we started doing network based storage in the early to midst 90’s we started out with a brand new protocol called Fibre-Channel. (I’m sure you heard of it.) This first iteration was based on arbitrated loop basically meaning we connect the TX port of an HBA to a RX port of a disk or tape device and vice versa effectively causing a loop in a P-t-P topology. When more HBA’s and/or storage devices were inserted you would get a ring topology. This was OK when you had around 3 or 4 devices in a ring (126 were possible) however from a manageability perspective you can imagine this was nightmare. So a new device called a FC-HUB was invented. This at least provided a single connectivity platform so you could run all your cables to the same box. Internally however this was still a loop topology since each hub port just forwarded the frames to the next port which in turn sent it to the device who, if the frame was not addressed to him sent it back to the hub and so on until it reached the destination. Now, this wasn’t really an effective way of doing things so at first the hub got a bit more intelligent by becoming a, so-called, loop switch. This meant the hub port looked at the destination address and if it wasn’t destined for a device attached to his port he would just sent it on to the next hub. This continued until the destination port was reached who then opened the port and sent the frame to the device.

As you can imagine in some larger loop topologies whenever a device came online or off-line every single device in that loop had to be made aware of this change and as such the LIP (Loop Initialization Protocol) was invented. This protocol made sure that each device got a sort of “update” of the appeared or disappeared device. Later on the loop methodology was almost entirely abandoned by switched fabrics who are far more intelligent in shoving frames in the right direction.

Now remember that Fibre-Channel was developed with one thing in mind ans that was to get the maximum possible speed out of very reliable networks. This also meant that no error-correction is done on a protocol layer and ever possible recovery option available was handled by the upper layer protocols like IP or SCSI.
The problem still was that you always has a single point of failure irrespective of which topology you chose. If you had a server in a loop and the HBA had a problem the entire loop could potentially be mucked up. The same when a AL-HUB or FC switch had a problem. All your connections to your disks would be lost and at best you had the luck to use journal-led filesystems who were relatively fast in recovering. How many of you have waited 5 or more hours for a windows chkdsk to finish just to find out it had no problem of the entire disk was corrupted and you had to restore from tape.

So to circumvent that the storage folk more or less determined that you would need at least 2 of everything physically separated so no component could affect the availability of another. This is were MPIO comes in since when you have multiple paths to a device over separate channels the operating system just sees it as a different device so potentially you end up with two disks (or tapes or whatever) which physically it the same volume. MPIO software fixed that by building in logic to present just one volume to the OS. The other thing they build in MPIO was the link error detection. If a link dropped light or lost sync for whatever reason the HBA would go into a non-active state and sends a signal to the upper layer that it had lost the link, MPIO could redirect all IO’s to the other paths and everything would live happily ever after. If that link came back again MPIO would pick this up and provided the option to use that path again and we were on our way.

This shows that MPIO is relying on HBA state signals upon which MPIO can act. The problem however is that a link might drop somewhere else in the fabric.This way the HBA has no problem since its link is still up, in sync and shows no other issues. The only way for MPIO to observe such a problem is to detect an IO failure and react on one or more of these failures by putting the logical path in an offline state. (The physical link from the HBA to the switch is still online.)
This imposes another problem. What if there is no IO going over that path. Many storage networks are designed in an active passive configuration so only one logical path is sending and receiving IO’s. If there is a problem on the passive side of the path but it is further downstream in the fabric the HBA will not notice this and, as such, there will be no notification to the MPIO layer and MPIO will never put this path offline. In case of a real problem on the active side MPIO tries to fail over however it will run into the same problem and both paths to the devices will fail therefore causing the same problem. Many MPIO software vendors like HDLM from Hitachi have build in logic to test for such conditions. In HDLM you configure so called IEM (Intermittent Error Monitoring). HDLM will poll the target device by sending a sector 0 read request every once in a while to the target device and if that succeeds it will wait for the next polling cycle. If an error has been observed more times than the configured threshold it will put the path offline.

You might think we’ve covered everything now and I wish it was true. MPIO only acts upon frames going AWOL but as you’ve seen in my previous article the major problem is often beyond the data frames and a vast majority these days is due to problems in flow control. This in turn causes slow drain device which have an effect of depleting credits further downstream.

Only the FC layer 2 has any notion of buffer credits and this is never propagated to the upper level protocol stack. This is true for any HBA, firmware, driver, MPIO software and OS. If any problems occur downstream of the initiator or upstream of the target, all devices in that particular path will incur a performance impact and an availability problem at some point in time. MPIO will NOT help in this case as I explained above.

The only way to prevent this from happening is active monitoring and management of you entire fabric and if any apparent link issues do surface fix them immediately.

What do you look for in these cases. Basically all errors that might affect an FC frame or FC traffic flow.
In Brocade FOS there is a command called “porterrshow” of which the output looks like this.

The 7 columns outlined show if any issues with frames and/or primitives have been happening at some point in time. (Use the “help porterrshow” command to show an explanation of each of the columns.). Use subsequent porterrshow command to see if any of them are increasing. The other option is to create a new baseline with the “statsclear” commandso all counters are reset to 0.

Cisco has a similar output albeit being a non-table format with the “show interface detailed-counters”.

The next article outlines an option in Brocade FOS to detect a slow drain device with the bottleneckmon feature and how to automatically disable a port if too many errors of one of the above counters have occurred in a certain time-frame. If you have a Brocade FOS admin manual look at the port-fencing feature.

Kind regards,
Erwin

4 responses on “One rotten apple spoils the bunch – 2”

Erwin van Londen 23/07/2012 at 15:12

No, I won’t be there. My HDS T11 rep needs to schedule this but I first need to get some things out of the way from a technical perspective.

It should go into FC-FS but I don’t know which release yet.
seb 23/07/2012 at 15:09

Are you attending one of the f2f meetings about this topic? In which “workstream” is this discussed?
Erwin van Londen 23/07/2012 at 14:17

I have something in the pipeline to address this. Will most likely be addressed at T11 in October or November.
seb 23/07/2012 at 13:13

Good article! The ugly thing with most of the multipathers is that they are not really built for problems like these. They can cope with links going down->be repaired->go up in perfect shape again. But if frames get dropped in the SAN due to slow drain devices, the multipather will notice a problem, but with a later TUR (Test Unit Ready) the path looks good for it again and will be further used. It would be better if the MPIO would track response times over the different paths and would move load towards better performing paths (avoiding the ones impacted by the slow drain device).
Cheers seb