Why frames get dropped at your storage array

Arrays from all vendors have significant troubleshooting capabilities but in most cases these are scambled in the most obscure way you can imagine. You need special tools to be able to have a look at the log-dumps that come out of these systems. The nice part is that they contain a wealth of information including what is going on at the FC layer.

The challenge however is to properly analyse and interpret these logs. As the FC standards do not describe how so phenomenons should be handled as long as the end-result is achieved there is a lot of pathways vendors and engineering department take. It all depends how some standards are read by which engineering team.

One curious thing I saw when working on a customer case is that at an array port there were a lot of frames that got dropped by the array but they didn’t tell me why. Now there area few reasons why an array port (or any port whether being an HBA, vHBA (NPIV) etc) may drop a frame.

The first most obvious reason is if that frame is not destined for that port. This is simply caused by an incorrect routing decision somewhere. This is however the least probable cause as the switching algorithms with all switch and router vendors have been chiseled in concrete for almost the lifetime of Fibre-Channel and welded in place in the asics.  As the switching algorithms are at the core of the connectivity side between initiators and targets no-one is poking under the hood to see if they can “optimise” something without being significantly scrutinized by a lot of extremely smart-people. Consequences of things going wrong is  massive.

That being said there are a few tiny moments where this can occur and that is mainly when targets or initiators leave the fabric (for what ever reason) and the switching decision on a few frames has already been made. This obviously would lead those frame to arrive at the destination and simply run into a shut door causing them to take a left turn into no-mans-land.

So as I could safely exclude that equation the most obvious reason would be link-related problems where a encoding or crc error would have corrupted the frame hence rendering it useless and thus the array simply drops it. I went back and looked at the link-error statistics of that target port and these were simply all showing 0 (zero). That baffled me a bit at first but when you take into account the first paragraph above you can think of other reasons why some counters remain at zero even though a frame might be corrupt.

As most of you know Brocade does forward frames which observe a crc-error. The FC standard does not specify that an E-port of F-port is required to discard a frame when it has detected  such an error. A frame may thus be happily switched to its end destination and that N-port then decides how to react. Obviously the content is useless as the port is unable to correct that data (FEC not taken into account)and the port discards the frame. It is however in many circumstances able to determine to which FC exchange and sequence this frame did belong to and take appropriate actions way quicker than if it had to wait for timers to expire.

So why does the array-port then not show this crc error in the link-error-status-block (LESB)? Well, when a Brocade ASIC sees a frame with a crc-error it will forward that frame but it will also change the EOF to indicate it has invalidated the frame. A frame normally ends with an EOFn (for End Of Frame Normal) and when a Brocade ASIC detects an error it will change that to EOFi (End Of Frame Invalid).

Depending on firmware revisions and FC chip manufacturer the firmware may decide to not increase the crc counter when a frame is already invalidated. You might argue that is contradictory to the percieved benefit of troubleshooting however this isn’t neccesarily true. When a frame arrives at a port which already has been invalidated and observes a crc error it is easier to exclude the switch->array link from having any issue and the cause of the corruption of the frame should be sought further upstream towards the initiator.  Especially when other error counters like ITW (Invalid Transmission Words), synch or link counters are also 0 other causes internally on the port (like the SERDES chip) can also be excluded.

Brocade has for this reason added an additional column in it’s porterrshow output called “crc g_eof”. (CRC error with a Good EOF). If you see this it is very likely other counters will also show issues as the corruption has occurred on this link.

Be aware that a FC port is only able to DETECT corruption and is not the cause so if you see crc errors on a port the corruption has already occurred. No use replacing the SFP or even the entire switch. (Yes I’ve seen that as well…..)

Regards,

Erwin

Print Friendly, PDF & Email

Subscribe to our newsletter to receive updates on products, services and general information around Linux, Storage and Cybersecurity.

The Cybersecurity option is an OPT-OUT selection due to the importance of the category. Modify your choice if needed.

Select list(s):