Many customer cases I handle are related to poor connectivity. A connectivity problem can be caused by unclean connectors, broken cables or SFP's. (See one of my earlier blog posts).
Although the switches are capable or identifying physical issues and subsequently notifying administrators, it's hardly ever being followed up. Very often an acute issue is lingering for days before an administrator starts investigating and in many cases this is only because of a server admin start complaining of SCSI errors or IO time-outs or very poor performance.
So how do we prevent this from happening? Well, for starters make sure that your environment is clean. With this I mean you should make sure that all connectors are not exposed to dust or other types of contamination. Secondly try to handle cables with care. I've seen many cases where cables were under so much tension that Jimmy Hendrix would be able to compose one of his finest works on it. Although modern fibre cables are fairly rugged and are able to handle a fair amount of tension try not to test this. At a last bullet point I would suggest to keep an eye out on light emitting power ratios. As you most likely know lasers do not have an infinite lifetime and their transmission power will decrease over time. At some point in time the receiving end of a link is most likely no longer able to distinguish between on or off in a reliable manner and as such the 8b10b (or 64b/66b) encoding/decoding algorithm will start to detect bit flips and as such it will discard a transmission word. The upper and lower power requirements are published in the data-sheets so as soon as one of these values reach their lower values replace them.
Now you might argue that if you have 10000 ports in your fabric you might have other things to worry about than checking SFP power values every day. The stress put on storage admins is not decreasing the last time I looked so this will most likely not be the case for the years to come.
Fortunately you don't have to. Both Brocade and Cisco provide option to monitor each individual component. For many years Brocade has one of the best embedded management tools there is namely Fabric Watch (FW). FW is not an active management tool per-se however the underlying goal is to have a sort of self-healing and protecting framework to monitor, alert and take action on events that might have implications on overall fabric behaviour.
A single dodgy link can have significant implications on overall fabric behaviour which can, and will, impact many hosts depending on topology and traffic pattern. FW allows you to set thresholds on many items in a switch from SFP power values, link errors, temperature readings etc etc. Each of these items can be configured with certain characteristics like above,below,in-between or change values. On each of these a time frame can be configured.
Now lets take an example on a link that has some intermittent errors. Your applications tolerate a certain error ratio per time-frame that they can recover from so in case on or two IO errors per hour are seen by the OS or application it will re-send the read or write command and all is good. If however, this starts to increase you might end up with the application going down or even data corruption. If you have configured FW to send a notification in case the amount of errors increase beyond the application tolerance, you will be able to take some action and investigate were the problem might be.
Now there is another issue and that is that you're most likely not sitting behind a console 24x7 or monitoring emails during your holidays. So even if you do get notified there is a good chance you will not notice it. (I know I won't when I'm playing golf :-))
These call for some more drastic measures and this is also covered by FW. If a certain threshold increases beyond a warning level and reaches a critical level FW allows you to take some action right away. This is a feature Brocade call port-fencing. Basically what it means is that this threshold is met it will just disable the port to prevent it from propagating the problems further up in the fabric. This is REALLY an area you SHOULD investigate. It can save you from having many issues showing up all over the fabric.
The title of this blog post is unfortunately the status as it now stands with most of the installed base of fabrics and the reason seems to be that administrators have a problem with software deciding on disruptive actions like disabling ports. My argument is that this port is already in a degraded state plus it also causes other links in the entire fabric having problems. If you don't know what your looking for and have this large 10000 port fabric it will take you a significant amount of time before you know what's going on. In this time many, many more hosts and applications can and will suffer from significant performance and other problems which might create some significant overtime for many people.