CRC errors – EvL Consulting

In Fibre-Channel, and many other network protocols, the use of CRC (Cyclic Redudancy Check) is adopted to detect corruption of frames. Be aware of the word “frames”!! As I explained in previous posts there are two layer of link integrity, an 8b/10b encoding/decoding algorithm (on 10G and 16G FC it has been changed to 64/66) which ensures dc balance plus error detection on the FC1 layer plus CRC which provides an additional check on the FC2 layer. Primitive signals or sequences are not frames and thus are not guarded with a CRC check.

Crc has the benefit that it can calculate on a serial bitstream as opposed to some other methods which require a certain fixed size of data in order to provide an integrity check. (A PKI like infrastructure is something along these lines such as GPG which can cryptographically sign an email message based on the entire content before it is sent). Secondly the calculation and reverse checking is very simple which means it can be build in hardware (ASICs or FPGA’s) without the need for software spending CPU cycles on both ends of the link which would have a serious impact on performance. There is absolutely no integrity check or security mechanism build into crc so any content can easily be modified, the crc recomputed and forwarded without the receiving side knowing it. In my test environments I use such methods to change an FC frame in-flight in order to see the behaviour of the modified content on the HBA or array. This allows me to inject data to test on protocol errors and the subsequent actions. (If I wasn’t able to recompute the crc, the destination port would detect the incorrect crc and just discard the frame.)

Now, going back to Fibre Channel. The CRC is calculated from the Start Of Frame (SOF) until the last word of the payload and is then appended to the frame. The FC2 layer will then add an End Of Frame (EOF) with a status qualifier. (I get back to this later).

Below a screenshot of a FC trace where the host issued a write(10) command to lun 32. The CRC is determined to be correct and the frame is ended with a EOFt (terminate, this doesn’t mean the IO is terminated but this sequence of the FC exchange is completed. I won’t go into this any further)

fc_cmd

If for any reason one or more bits in the bitstream between the SOF and last bit of the payload is changed the receiving side will do a reverse crc check which obviously will fail.

Now, I mentioned that the calculation is done inline of the bitstream. These days all fibre-channel implementations from all vendors use cut-through switching which more or less means that as soon as the first word of the FC frame is received (the one that contains the DID Destination ID or FCID) it is immediately forwarded to the out-port of that switch having the route set up according to FSPF. The second word of the frame may not even have arrived in full yet. This ensures optimal performance with next to no latency from a switching algorithm perspective. If you then do some maths and calculate the length of a FC frame it means that the first word of the frame may have already arrived at its final destination before the last word of the frame has even left the source.

frame_corruption_1 In the above picture you see a representation of three switches (pardon my drawing skills). Switch number one on the left has received a frame from an HBA and is sending this out on port 1. The frame is 1KB in size and the link speed is 8Gb/s. This means that the length of the frame is almost 250 meters long. If one or more bits flip at the 512th byte on the link between 1 and 2 the beginning of the frame is already at it’s destination so nowhere in the entire FC path any form of correction can be done. (there is an exception called FEC but I’ll discuss this in a later post). What will happen is that the port at (2) will detect the crc error and it will replace the EOFn or EOFt with a EOFni or EOFti (the i means invalid.) All switches will forward the frame to its destination (unless the DID in the frame-header is the part which is corrupt). As soon as the EOFxi arrives at the destination it can immediately discard the entire frame, clear the buffer and start the recovery procedures. If the intermediate switch detects the crc error and it would have discarded the frame over there both the initiator and target would have no clue what’s going on and would rely on the default FC and SCSI timeout values before is would be able to act on these.

From a troubleshooting perspective if you now look on the error counters on the switch ports (on a Brocade platform) you will see that port (2) will have logged the crc error in two columns of the porterrshow output: the crc_err and crc_g_eof (CRC error with a good EOF). Since the intermediate switch still forwards the frame port (4) will also detect the same crc error however since port (2) changes the EOFx into an EOFxi (invalid) port only the crc_err column on this port is incremented and not the crc_g_eof column. This mechanism allows you to follow upstream paths and determine where these errors originate from.

Hope this brings some insight and gives you better info of how to interpret these kind of errors.

Regards,

Erwin