One rotten apple spoils the bunch - 3

In the previous 2 blog-posts we looked at some areas why a fibre-channel fabric still might have problems even with all redundancy options available and MPIO checking for link failures etc etc.
The challenge is to identify any problematic port and act upon indications that certain problems might be apparent on a link.

So how do we do this in Brocade environments? Brocade has some features build into it’s FOS firmware which allows you to identify certain characteristics of your switches. One of them (Fabric-Watch) I briefly touched upon previously. Two other command which utilize Fabric_Watch are bottleneckmon and portfencing. Lets start with bottleneckmon.

Bottleneckmon was introduced in the FOS code stream to be able to identify 2 different kinds of bottlenecks: latency and congestion.

Latency is caused by a very high load to a device where the device cannot cope with the offered load however it does not exceed the capabilities of a link. As an example lets say that a link has a synchronized speeds of 4G however the load on that link reached no higher than 20MB/s and already the switch is unable to send more frames due to credit shortages. A situation like this will most certainly cause the sort of credit issues we’ve talked about before.

Congestion is when a link is overloaded with frames beyond the capabilities of the physical link. This often occurs on ISL and target ports when too many initiators are mapped on those links. This is often referred to as an oversubscribed fan-in ratio.

A congestion bottleneck is easily identified by looking at the offered load compared to the capability of the link. Very often extending the connection with additional links (ISL, trunk ports, HBA’s) and spreading the load over other links or localizing/confining the load on the same switch or ASIC will most often help. Latency however is a very different ballgame. You might argue that Brocade also has a portcounter called tim_txcrd_zero and when that reaches 0 pretty often you also have a latency device but that’s not entirely true. It may also mean that this link is very well utilized and is using all its credits. You should also see a fair link utilization w.r.t. throughput but be aware this also depends on frame size.

So how do we define a link as a high latency bottleneck? The bottleneckmon configuration utility provide a vast amount of parameters which you can use however I would advise to use the default settings as a start by just enabling bottleneck monitoring with the “bottleneckmon –enable” command. Also make sure you configure the alerting with the same command otherwise the monitoring will be passive and you’ll have to check each switch manually.

If a high latency device is caused by physical issues like encoding/decoding errors you will get notified by the bottleneckmon feature however when this happens in the middle of the night you most likely will not be able to act upon the alert in a timely fashion. As I mentioned earlier it is important to isolate this badly behaving device as soon as possible to prevent it from having an adverse effect on the rest of the fabric. The portfencing utility will help with that. You can configure certain thresholds on port-types and errors and if such a threshold has been reached the firmware will disable this port and alert you of it.

I know many administrators are very reluctant to have a switch take these kind of actions on its own and for a long time I agreed with that however seeing the massive devastation and havoc a single device can cause I would STRONGLY advise to turn this feature on. It will save you long hours of troubleshooting with elongated conference calls whilst your storage network is causing your application to come to a halt. I’ve seen it many times and even after pointing to a problem port very often the decision to disable such a port subject to change management politics. I would strongly suggest that if you have such guidelines in your policies NOW is the time to revise those policies and enable the intelligence of the switches to prevent these problem from occurring.

For some comprehensive overview, options and configuration examples I suggest you first take a look at the FOS admins guide of the latest FOS release versions. Brocade have also published some white-papers with more background information.

Regards,
Erwin

2 responses on “One rotten apple spoils the bunch – 3”

Erwin van Londen 25/07/2012 at 11:51

Hi Seb,

Thanks for your comments.

Indeed you need to take care on which classes and entities you enable this portfencing and as such instructions from support people such as yourself and I should be followed. The thing I wanted to emphasize is that administrators and operational management should strive to use this feature as a measure to prevent problems and not wait for such problems to happen after which significant downtime and possible recovery efforts have to be undertaken to get the infrastructure back on track again.
seb 25/07/2012 at 08:54

When it comes to portfencing I think it’s worth to mention that you can configure it to trigger for C3 discards due to timeout on the TX side. If you do that only for F-Ports, you should be able to catch hard slow drain device ports. Don’t do this for E-ports, because due to backpressure there could be a trail of dropped frames through the whole fabric and you would fence a lot of “affected-but-not-the-culprit”-ports. Plus: If you enable portfencing also against other “areas” like PE (protocol errors), set the correct fillword first. If you use fillword 0 (IDLE/IDLE) for an 8G connection, the link might come up (because the link initialization uses IDLE and is okay then), but then the switch interprets the incoming ARBff fillwords as protocol errors (because it expects IDLEs) and fences the port. You see that as “er_bad_os” in portstatsshow.
Cheers seb
Cheers