Tag Archives: improvement

One rotten apple spoils the bunch – 1

Last week I had another one. A rotten apple that spoiled the bunch or, in storage terms, a slow drain device causing havoc in a fabric.

This time it was a blade-center server with a dubious HBA connection to the blade-center switch which caused link errors and thus corrupt frames, encoding errors and credit depletion. This, being a blade connected to a blade-switch, also propagated the credit depletion back into the overall SAN fabric and thus the entire fabric suffered significantly from this single problem device.

“Now how does this work” you’ll say. Well, it has everything to do with the flow-control methodology used in FC fabrics. In contrast to the Ethernet and TCP/IP world we, the storage guys, expect a device to behave correctly, as gentleman usually do. That being said, as with everything in life, there are always moment in time when nasty things happen and in the case of the “rotten apple” one storage device being an HBA, tape drive, or storage array may be doing nasty things.

Let’s take a look how this normally should work.

FC devices run on a buffer-to-buffer credit model. This means the device reserves an certain amount of buffers on the FC port itself. This amount of buffers is then communicated to the remote device as credits. So basically devices a gives the remote device permission to use X amount of credits. Each credit is around 2112 bytes (A full 2K data payload plus frame header and footer)

The number of credits each device can handle are “negotiated” during fabric login (FLOGI). On the left a snippet from a FLOGI frame were you see the number of credits in hex.

So what happens after the FLOGI. As an example we use a connection that has negotiated 8 credits either way. If the HBA sends a frame (eg. a SCSI read request) it knows it only has 7 credits left. As soon as the switch port receives the frame it has to make a decision where to send this frame to. It does this based on routing tables, zoning configuration and some other rules, and if everything is correct it will route the frame to the next destination. Meanwhile it simultaneously sends back a, so called, R_RDY primitive. This R_RDY tells the HBA that it can increase the credit counter back by one. So if the current credit counter was 5 it can now bump it back up to 6. (A “primitive” lives only between two directly connected ports and as such it will never traverse a switch or router. A frame can, and will, be switched/routed over one or more links)

Below is a very simplistic overview of two ports on a FC link. On the left we have an HBA and on the right we have a switch port. The blue lines represent the data frames and the red lines the R_RDY primitives.

As I said, it’s pretty simplistic. In theory the HBA on the left could send up to 8 frames before it has to wait for an R_RDY to be returned.

So far all looks good but what if the path from the switch back to the device is broken? Either due to a crack in the cable, unclean connectors, broken lasers etc. The first problem we often see is that bits get flipped on a link which in turn causes encoding errors. FC up to 8G uses a 8b10b encoding decoding mechanism. According to this algorithm the normal 8 data bits are converted to a, so called, 10-bit word or transmission character. These 10 bits are the actual ones that travel over the wire. The remote side uses this same algorithm to revert the 10-bits back into the original 8 data bits. This assures bit level integrity and DC balance on a link. However when a link has a problem as described above, chances are that one or more of these 10-bits flip from a 0 to 1 or vice-versa. The recipient detects this problem however since it is unaware of which bit got corrupted it will discard the entire transmission character. This means that if such a corruption is detected it will discard en entire primitive, or, if the corrupted piece was part of a data frame, this entire frame will be dropped.

A primitive (including the R_RDY) consists of 4 words. (4 * 10 bits). The first word is always a control character (K28.5) and it is followed by three data words (Dxx.x). 

0011111010 1010100010 0101010101 0101010101 (-K28.5 +D21.4  D10.2  D10.2 )

I will not go further into this since its beyond the scope of the article.

So if this R_RDY is discarded the HBA does not know that the switch port has indeed free-ed up the buffer and still think it can only send N-1 frames. The below depicts such a scenario:

As you can see when an R_RDY is lost at some point in time it will become 0 meaning the HBA is unable to send any frames. When this happens an error recovery mechanism kicks in which basically resets the link, clearing all buffers on both side of that link and start from scratch. The upper layers of the FC protocol stack (SCSI-FCP, IPFC etc) have to make sure that any outstanding frame have either to be re-transmitted or the entire IO needs to be aborted in which case this IO in it’s entirety needs to be re-executed. As you can see this will cause a problem on this link since a lot of things are going on except actually making sure your data frames are transmitted. If you think this will not have such an impact be aware that the above sequence might run in less than one tenth of a second and thus the credit depletion can be reached within less than a second. So how does this influence the rest of the fabric since this all seems to be pretty confined within the space of this particular link.

Let broaden the scope a bit from an architectural perspective. Below you see a relatively simple, though architecturally often implemented, core-edge fabric.

Each HBA stands for one server (Green, Blue,Red and Orange), each mapped to a port on a storage array.
Now lets say server Red is a slow drain device or has a problem with its direct link to the switch. It is very intermittently returning credits due to the above explained encoding errors or it is very slow in returning credits due to a driver/firmware timing issue. The HBA sends a read request for an IO of 64K data. This means that 32 data frames (normally FC uses a 2K frame size) will be sent back from the array to the Red server. Meanwhile the other 3 servers and the two storage arrays are also sending and receiving data. If the number of credits negotiated between the HBA’s and the servers is 8 you can see that after the first 16K of that 64K request will be send to Red server however the remaining 48K still is either in transit from the array to the HBA or it is still in some outbound queue in the array. Since the edge switch (on the left) is unable to send frames to the Red server the remaining data frames (from ALL SERVERS) will stack up on the incoming ISL port (bright red). This in turn causes the outbound ISL port on the core switch (the one on the right) to deplete its credits which means that at some point in time no frames are able to traverse the ISL therefore causing most traffic to come to a standstill.

You’ll probably ask “So how do we recover from this?”. Well, basically the port on the edge switch to the Red server will send a LR (Link Reset) after the agreed “hold-time”. The hold time is a calculated period in which the switch will hold frames in its buffers. In most fabrics this is 500ms. So if the switch has had zero credit available during the entire hold period and it has had at least 1 frame in its output buffer it will send a LR to the HBA. This causes both the switch and HBA buffer to clear and the number of credits will return to the value that was negotiated during FLOGI.

If you don’t fix the underlying problem this process will go on forever and, as you’ve seen, will severely impact your entire storage environment.

“OK, so the problem is clear, how do I fix it?”

There are two ways to tackle the problem, the good and the bad way.

The good way is to monitor and manage your fabrics and link for such a behavior. If you see any error counter increasing verify all connections, cables, sfp’s, patch-panels and other hardware sitting in between the two devices. Clean connectors, replace cables and make sure these hardware problems do not re-surface again. My advice is if you see any link behaving like this DISABLE IT IMMEDIATELY !!!! No questions asked.

The bad way is to stick your head in the sand and hope for it go away. I’ve seen many of such issues crippling entire fabrics and due strictly enforced change control severe outages occurred and elongated recovery (very often multiple days) was needed to get things back to normal again. Make sure you implement emergency procedures which allow you to bypass these operational guidelines. It will save you a lot of problems.

Regards,
Erwin van Londen

Fibre Channel improvements.

So what is the problem with storage networking these days? some of you might argue that it’s the best thing since sliced bread and it’s the most stable way to shove data back and forth and maybe it is however this is not always the case. The problem is some gaps still exist which have never been addressed and one of them is resiliency. There is a lot that has been done to detect errors and to try to recover from them but nobody ever thought of how to prevent errors from occurring. (Until now that is). Read on.

So what is the evolution of a standard like Fibre Channel.It normally is born out of a need that isn’t addressed with current technologies. The primary reason FC was erected is that the parallel SCSI stack had a huge problem with distance. It did not scale beyond a couple of meters and was very sensitive to electrical noise which could disturb the reliable transmission that was needed for a data intensive channel protocol like SCSI. So somebody came up with the idea to serialise the data stream and FC was born. A lot of very smart people got together and cooked up the nifty things we now take for granted like massive address-space, zoning, huge increase in speed and lot of other goodies which could have never been achieved with a parallel interface.

The problem is that these goodies are all created in the dark dungeons of R&D labs. These guys don’t speak much (if at all) to end-user customers so the stuff coming out of these labs is very often extremely geeky.
If you follow a path from the creation of a new thing (whether technology or anything else) you see something like this:

  1. Market demand
  2. R&D
  3. Product
  4. Sales
  5. Customers
  6. Post sales support

The problem is that very often there is no link between #5/#6 and #2. Very often for good reason but this also inflicts some serious challenges. Since I’m not smart enough to work in #2 I’m on the bottom of the food chain working in #6. 🙂 But I do see the issues that arise in this path so I cooked something up. Read on.

Going back to fibre channel there is one huge gap and that is fault tolerance and the acting upon failures in a FC fabric. The protocol defines how to detect errors and how to try to recover from these but is does not have anything which defines how to prevent errors from reoccurring. This means that if an error has been detected and frames get lost we just say “OK, lets try it again and see if it succeeds now”. It doesn’t take a genius to see that if something is broke this will fail again.

So on the practical side there are a couple of things that most often go wrong and that is the physical side of things like SFP’s and cables. These result in errors like encoding/decoding failures, CRC errors, signal and synchronization errors. If these occur the entire frame including your data payload will get dropped and we’re asking to the initiator of that frame to try and resend it. If however the initiator does not have this frame in it’s buffers anymore we rely on the upper layer protocol to recover from this. Most of the time it succeeds, however, as previously mentioned, if things are really broke this will fail again. From an operating system perspective you will see this as SCSI check conditions and/or read/write failures. On a tape environment this will often result in failed back/restore jobs.

Now, you’re gonna say “Hold on buddy, that why we have dual redundant fabrics, multiple entries to our LUNS, multipathing etc etc” i.e. redundancy. True, BUT, what if it is just partially broken? An dodgy SFP or HBA might send out good signals but there could also be a certain amount of not so good signals. This will result in intermittent failures resulting in the above mentioned errors and if these happen often enough you might get these problems. So, although you have every piece of the storage puzzle redundant, you might still run into problems which, if severe enough, might affect your entire storage infrastructure. (and it does happen, believe me)

The underlying problem is that there is no communication between N-Ports and F-ports as well as lack of end-to-end path error verification to check if these errors occur in the fabric and if so how to mitigate or circumvent these. If an N-port sends out a signal to an F-port which gets corrupted underway there is no way the F-port is notifying the N-port and saying “He, dude you’re sending out crap, do something about it”. Similar issue is in meshed fabrics. We all grew up since 1998 with FSPF (Fabric Shortest Path First) which is a FC protocol extension to determine the shortest path from A to B in a FC fabric based on a least cost routing algorithm. Nothing wrong with that however what if this path is very error prone? Does the fabric have any means to make a decision and say “OK, I don’t trust this path, I’ll direct that traffic via another route”? No, there is nothing in the FC protocol which provides this option. The only way routes are redefined is if there are changes in the fabric like an N-Port coming online/offline and registers/de-registers itself with the fabric nameserver and RSCN (Registered Name Change Notifications) are sent out.

For this reason I submitted a proposal to the T11 committee via my, teacher and father of Fibre Channel Horst Truestedt, to extend the FC-GS services with new ways to solve these problems. (proposal can be downloaded here )

The underlying thoughts are to have port-to-port communication to be able to notify the other side of the link it is not stable as well as have and end-to-end error verification and notification algorithm so that hosts, hba’s and fabrics can act upon errors seen in the path to their end devices. This allows active redirection of frames to circumvent frames of passing via that route as well as the option to extend management capabilities so that storage administrators can act upon these failures and replace/update hardware and/or software before the problem becomes imminent and affects the overall stability of the storage infrastructure. This will in the end result in far greater storage availability and application uptime as well as prevent all the other nasty stuff like data corruption etc.

The proposal was positively received with an 8:0 voting ratio so now I’m waiting for a company to pull this further and actually starting to develop this extension.

Let me know what you think.

Regards
Erwin