When to replace an SFP

Maybe the title should be “When to NOT replace an SFP” as I see in many occasions it is seen as a first option of fixing problems. In reality the SFP is one of the least failing components in the environment and replacing them more often leads to other problem than actually resolving the original issue.

When a problem on a switch surfaces one of the first commands on a Brocade switch that is used to see if anything is wrong is “porterrshow”. This provides an overview of all errors accumulated on all ports. Therein lies problem number one. The values are accumulated from the time the switch has been rebooted, a port has comes online or since the values have manually cleared. Secondly all these values are stored in a so called LESB (Link Error Status Block) which is a 32bit buffer for each of the values (well, most of them.) As the LESB is a circular buffer it will return back to zero after the full value has been reached so using this output without context of a baseline is useless. You really need to know if values are accumulating (or have accumulated) during or since you’ve observed a problem on either your storage arrays or hosts. If values have ramped up last year they will still be shown in this overview but have most likely nothing to do with this particular problem.

Setting a new baseline is as easy as using the CLI and entering the command “statsclear”. This will clear all values of every port back to 0 so for the entire (logical) switch you can now see if port errors are increasing by using the “porterrshow” command again. 

Now here is the magic bullet. NONE of these values actually imply that there is a problem with either the port or the SFP (directly) The port itself only DETECTS a problem but is in most cases not the CAUSE of the issue. If the port detects a CRC error because it has done a reverse crc32 calculation and the result does not equal the value the sending port has calculated it does not mean this receiving port has a problem. It merely detects and reports the error. The same thing is true for encoding errors. (be aware that CRC calculations are done on an individual frame level between two end points and encoding/decoding is done on a byte level between two adjacent FC ports.

In 99.999% of all cases I’ve seen there is an issue in the cabling infrastructure. Whether this being cables, connectors, patch-panels, long-distance links or other physical components all these may be the cause of the receiver seeing corrupted frames or primitives. Even though the connection between two FC ports is bi-directional the flow of light is unidirectional. Each SFP has a TX side which “talks” to an RX side of its remote counterpart and vice versa. It may well be you see issue on one port but none on the other. It may well be that only one strand of the optical cable is broken or one side of a connector is dirty. 

Replacing an SFP based on errors in the porterrshow output is like watching to a bush-fire with binoculars and expecting the bush-fire to be extinguished by replacing the binoculars.

I’ve written a fair few articles before on how to analyse this. See my previous posts in the troubleshooting section.

OK, I hear you say “When DO we need to replace these SFP then?”

There are only two indicators in a switch which outline a problem with the SFP. 

  1. A “switchshow” output shows a “Laser_Flt” on a particular port or
  2. The TX value of the “sfpshow” output it outside the supported range of the SFP.

The first indicator is fairly clear. The ASIC port is basically unable to communicate with the SFP CMOS or the SFP itself has reported an issue to the ASIC.

The second indicator is basically a failure of a three point calculation where the combination of mAmps * Voltage does not reach the required power in watts and therefore is unable to activate the laser on a dBm level it should reach according to manufacturer specification.  There is nothing you can do about this but to replace the SFP.

In all other cases the focus should be on the infrastructure I mentioned. 

Hope this helps in analysing and diagnosing a problematic link. If you have any questions let me know.

Regards,

Erwin

Print Friendly, PDF & Email

Subscribe to our newsletter to receive updates on products, services and general information around Linux, Storage and Cybersecurity.

The Cybersecurity option is an OPT-OUT selection due to the importance of the category. Modify your choice if needed.

Select list(s):