So what is the problem with storage networking these days? some of you might argue that it’s the best thing since sliced bread and it’s the most stable way to shove data back and forth and maybe it is however this is not always the case. The problem is some gaps still exist which have never been addressed and one of them is resiliency. There is a lot that has been done to detect errors and to try to recover from them but nobody ever thought of how to prevent errors from occurring. (Until now that is). Read on.
So what is the evolution of a standard like Fibre Channel.It normally is born out of a need that isn’t addressed with current technologies. The primary reason FC was erected is that the parallel SCSI stack had a huge problem with distance. It did not scale beyond a couple of meters and was very sensitive to electrical noise which could disturb the reliable transmission that was needed for a data intensive channel protocol like SCSI. So somebody came up with the idea to serialise the data stream and FC was born. A lot of very smart people got together and cooked up the nifty things we now take for granted like massive address-space, zoning, huge increase in speed and lot of other goodies which could have never been achieved with a parallel interface.
The problem is that these goodies are all created in the dark dungeons of R&D labs. These guys don’t speak much (if at all) to end-user customers so the stuff coming out of these labs is very often extremely geeky.
If you follow a path from the creation of a new thing (whether technology or anything else) you see something like this:
- Market demand
- Post sales support
The problem is that very often there is no link between #5/#6 and #2. Very often for good reason but this also inflicts some serious challenges. Since I’m not smart enough to work in #2 I’m on the bottom of the food chain working in #6. 🙂 But I do see the issues that arise in this path so I cooked something up. Read on.
Going back to fibre channel there is one huge gap and that is fault tolerance and the acting upon failures in a FC fabric. The protocol defines how to detect errors and how to try to recover from these but is does not have anything which defines how to prevent errors from reoccurring. This means that if an error has been detected and frames get lost we just say “OK, lets try it again and see if it succeeds now”. It doesn’t take a genius to see that if something is broke this will fail again.
So on the practical side there are a couple of things that most often go wrong and that is the physical side of things like SFP’s and cables. These result in errors like encoding/decoding failures, CRC errors, signal and synchronization errors. If these occur the entire frame including your data payload will get dropped and we’re asking to the initiator of that frame to try and resend it. If however the initiator does not have this frame in it’s buffers anymore we rely on the upper layer protocol to recover from this. Most of the time it succeeds, however, as previously mentioned, if things are really broke this will fail again. From an operating system perspective you will see this as SCSI check conditions and/or read/write failures. On a tape environment this will often result in failed back/restore jobs.
Now, you’re gonna say “Hold on buddy, that why we have dual redundant fabrics, multiple entries to our LUNS, multipathing etc etc” i.e. redundancy. True, BUT, what if it is just partially broken? An dodgy SFP or HBA might send out good signals but there could also be a certain amount of not so good signals. This will result in intermittent failures resulting in the above mentioned errors and if these happen often enough you might get these problems. So, although you have every piece of the storage puzzle redundant, you might still run into problems which, if severe enough, might affect your entire storage infrastructure. (and it does happen, believe me)
The underlying problem is that there is no communication between N-Ports and F-ports as well as lack of end-to-end path error verification to check if these errors occur in the fabric and if so how to mitigate or circumvent these. If an N-port sends out a signal to an F-port which gets corrupted underway there is no way the F-port is notifying the N-port and saying “He, dude you’re sending out crap, do something about it”. Similar issue is in meshed fabrics. We all grew up since 1998 with FSPF (Fabric Shortest Path First) which is a FC protocol extension to determine the shortest path from A to B in a FC fabric based on a least cost routing algorithm. Nothing wrong with that however what if this path is very error prone? Does the fabric have any means to make a decision and say “OK, I don’t trust this path, I’ll direct that traffic via another route”? No, there is nothing in the FC protocol which provides this option. The only way routes are redefined is if there are changes in the fabric like an N-Port coming online/offline and registers/de-registers itself with the fabric nameserver and RSCN (Registered Name Change Notifications) are sent out.
For this reason I submitted a proposal to the T11 committee via my, teacher and father of Fibre Channel Horst Truestedt, to extend the FC-GS services with new ways to solve these problems. (proposal can be downloaded here )
The underlying thoughts are to have port-to-port communication to be able to notify the other side of the link it is not stable as well as have and end-to-end error verification and notification algorithm so that hosts, hba’s and fabrics can act upon errors seen in the path to their end devices. This allows active redirection of frames to circumvent frames of passing via that route as well as the option to extend management capabilities so that storage administrators can act upon these failures and replace/update hardware and/or software before the problem becomes imminent and affects the overall stability of the storage infrastructure. This will in the end result in far greater storage availability and application uptime as well as prevent all the other nasty stuff like data corruption etc.
The proposal was positively received with an 8:0 voting ratio so now I’m waiting for a company to pull this further and actually starting to develop this extension.
Let me know what you think.