As you can read in my previous articles (here, here and here) having a physical issue on any of you FC links is detrimental to your entire FC infrastructure. Not only does it corrupt frames and primitives but is also resulting in traffic flow issues which may even propagate to other fabrics which even have a so called air-gap. (See here)
Removing bad links from an active fabric is of the utmost importance. Having these fixed asap will also prevent performance related issues due to congestion potentially moving on to increased latency and slow-drain behaviour.
Looking at the the porterrors (see here) will show you quickly if a link is showing signs of any physical problems. The enc_out, enc_in, crc_g_eof and pcs columns are the primary indicators of this. Having a set of MAPS rules which keep an eye on these counters and linking fencing or decommissioning rules is highly recommended.
Having a port fenced or decommissioned will result in the fact the PSM (Port State Machine) will deactivate the port and put it in an offline state. Depending on the rule this may be disruptive on active frames (fenced) or “less-disruptive” where outstanding frames are still sent but new frames are no longer accepted and the port is put in that offline state when the frames have left the switch (decommissioned). Both actions will however result in a few drawback in the way that the current active state of the hardware does no longer represent the state when the error condition occurred. This means that troubleshooting the hardware itself is relatively impossible due the changed state of the hardware.
One example is the SFP TX and RX values. When a port is fenced the SFP will be shutdown. Therefore it will no longer send a signal to it’s remote counterpart. Even if the remote side would still send a signal it will not be properly represented.
The SFP values on an enabled port would show up as:
Sydney_ILAB_X6_43_TEST:FID43:admin> sfpshow 4/29 -f
Current: 6.534 mAmps
Voltage: 3383.2 mVolts
RX Power: -2.3 dBm (593.2uW)
TX Power: -1.9 dBm (651.0 uW)
If the port is then disabled from a remote side you would see something like this:
Sydney_ILAB_X6_43_TEST:FID43:admin> sfpshow 4/29 -f
Current: 6.538 mAmps
Voltage: 3365.7 mVolts
RX Power: -24.4 dBm (3.6 uW)
TX Power: -1.9 dBm (648.6 uW)
As the TX power on the switchport is still well withing the supported range the RX power is not. This does not tell you if the remote SFP is broken, the port is shutdown, cable issue where the reflection is too high and not enough light is passed through or another problem that caused the link to go down.
The same when the switchport is in an inactive state: Current: 0.066 mAmps
Voltage: 3390.5 mVolts
RX Power: -2.3 dBm (587.6uW)
TX Power: -inf dBm (0.0 uW)
A “switchshow” output would show you the port is a disabled state and the reasoning why that was done. It still doesn’t give you absolute numbers in order to be able to determine if this SFP is broken or not.
F-ports and E-ports
FOS support fencing and decommissioning whereby fencing is supported on F- and E-ports. Decommissioning is support on E-ports but F-ports only when BNA is used. (See the respective manuals on how to set that up.)
One more thing this would cause if the fencing action is done on en E-port and disables ISL’s. This would not only have the drawbacks as pointed out above but also results in fabric changes triggering FSPF routing algorithms to kick in, zone- and nameserver updates to be distributes and port CAM tables to be updated on all switches affected. Although this should not be having too much of an impact it can sometimes be better to let the link itself in an active state but instead of decommissioning the port set it into an “impaired” state.
Impaired instead of shut
This is recommended on switches where more than 1 ISL is connected between two switches. Long distance link are particularly well suited for this configuration. (For more info on long distance see here).
The thought behind impairment is basically to leave the physical side intact but do not route any frames over that link. These will therefore not be susceptible for corrupting and would not incur issues on initiator or target devices. The only ting that may happen is that the remaining ISL’s get overloaded if not enough bandwidth is available to cope with the supply of workload. If an impairment approach is chosen it is wise to have an N+1 (or 2) approach where the N number of ISL’s is around 80% of the maximum workload that traverse these switches. If one ISL gets triggered by a MAPS rule resulting in impairment the workload traversing the remaining ISL’s will not result in congestion, unless other factors come in play.
A MAPS rule triggering an E-port impairment will only actually impair that port if it is NOT the last ISL between the two switches.
Cisco has similar port fencing techniques but the approach may be a bit different. Have a look at the port-monitor and port-guard functionality.
It is up to the SAN architect/designer to make decisions on how to approach the actions. Irrespective of what is chosen any form of preventative measures or automated reactive action should be implemented to ensure physical issues are not causing problems to propagate into a fabric. I’ve seen issues on a single port which brought down large parts of SAN’s even propagating onto fabric with physical air-gaps.
The entire rule-set should be reviewed and adjusted towards the capabilities of the applications and end-devices. Shutting down a port that gets a trigger on a rules with 2 crc errors per minute might not be the best decision. Operating systems and applications should be able to tolerate some errors and recover by resubmitting IO’s.
Intermittent errors are the most disruptive ones as these do not trigger MPIO actions (see here) in many cases. Implementing these port-fencing techniques will result in MPIO doing its job and moving workload onto remaining paths.
I hope the above provides some more info on the reasoning of port-fencing, decommissioning and impairment. Don’t hesitate to get in touch -> contact