Tag Archives: congestion

Brocade Fabric Vision – Version 1

As you may have read in my previous posts I’m not really a fan of marketing driven terminology whereby existing technology is re-branded over and over again in order to obfuscate the underlying technology and make things more complex that they really are. The FC Gen-X nonsense is one of them. With Brocade Fabric Vision it took me a while however I see where Brocade is going with this and more where it is coming from.

Continue reading

The Emergency Health Threat (EHT) on Brocade FC fabrics

Since a couple of FOS versions ago Brocade wanted to fix the problem Fibre-Channel has by definition called credit back-pressure. The word “problem” overstated since in 99.99999% of all fabrics you’ll never see this anyway.  As you know Fibre-Channel is a deterministic architected type of network which requires devices to behave properly. The chaos theory of Ethernet networks where flooding and broadcasts are more along the line of shoot the fly with a cannon doesn’t apply to Fibre-Channel.

So what is the issue then? By design two link-end points tell each other how many buffers they have available to store a FC frame during link initialization in the FLOGI (Fabric Login) phase. These is bi-directional so one side of the link can have 40 credits whilst the other only has three. This way the sending side know that it can send X amount of frames before it has to stop.

Below you see a snippet of a FC FLOGI trace plus the subsequent Accept.

The HBA tells the switch it has 3 buffers available so the switch is allowed to send a maximum of 3 frames before it need to wait for an R_RDY

 The switch then returns with an accept in which it tells the HBA that it has 8 buffers available.

When the sending side transmits a frame it subtracts one credit from this amount. The receiving side on the other side of the link forwards the frame to its destination and sends a, so called, R_RDY (Receiver Ready) primitive back to the transmitting side which causes the number of outstanding credits to be increased by one again. So far pretty simple but imagine the following scenario.

I have three servers on the left and a storage array on the right with a switch in the middle. You see that the link between the storage array and the switch carries all traffic from all three servers. In a normal situation this is no problem as long as all devices behave as they should. Send a frame as long as a credit is available and wait for an R_RDY to return to replenish that credit. Now what happens if the blue server start behaving badly? ie. for some reason either it is starting to become very slow on returning these R_RDY primitives or the R_RDY is corrupted due to a physical link problem. The second scenario is pretty easy to figure out since the switch logs these kind of problems in the porterrshow plus this particular link will log a lot of LR entries in the fabriclog. (Have a look)

To get back to the problem you’ll see that at some point in time when the number of credits are all used from port #6 to port #3 the frames that are coming in on the switch ingress port #7 can no longer be forwarded to that port 6 and subsequent port #3. So these buffers on port 7 will fill up pretty rapidly which also means there is no room anymore for frames arriving from the array destined to be forwarded via port #5 to the green server or port #4 to the red server.  So even when these two server have no problem at all they might get really impacted by the blue server. One way to overcome this is to shut down port #6 and you will see that traffic from the red and green server start flowing again. When there is a driver or firmware issue you will have some more troubles finding out what’s causing all this.

Now you may argue and say “listen dude, these frame do not sit in there indefinitely and at some point they will be discarded which allows for these buffer to be reused again so traffic starts flowing again.” Yes, you are right. There always has been a bit of a ballpark figure of a, so called, “hold time” which more or less meant that a frame may sit in an ingress buffer for X amount of time before the ASIC may discard that frame and free up the buffer and consequently return a R_RDY to the transmitting side. The issue is however that the transmitting port on the other side of the link might as well send a frame with the same troubled destination which also means that that frame will sit in the ingress buffer for that X amount of time. Brocade has used a formula to determine this “hold time” and on a default configured fabric this turns out to be 500ms. So in the previous example it may well be that for sending only two frames you lose an entire second. That’s an eternity in Fibre-Channel and storage so every precaution needs to be taken to prevent this from happening. Your performance across the fabric with sink through the toilet.

Now lets take this previous example one step further.

A fairly regular small to medium size SAN in a core-edge design looks a bit like this.

You have your servers on the left, the colored boxes are edge switches and the central ones are the cores. Nothing really fancy fancy but you already can see that from a trafficflow perspective  it becomes pretty weird already since everything can go everywhere based on the source and destination routing paths that are set up during fabric build and FSPF routing calculations. Especially on the ISL’s between the two cores you see all traffic from all servers, disk arrays and tape drives. You can imagine that is things start to stack up somewhere you will end up in the same scenario are I described before but with this difference that all traffic on those core ISL’s can be affected. So even one device which sits behind the red switch and has a target on the purple switch may cause a problem to servers sitting behind the yellow switch going to a target behind the light blue switch. Similar the other way around since high latency devices or software issues might also be in the target devices.

So how do you overcome such a situation? Well, since we can’t predict failures to the exact minute we can try to circumvent one device clogging up traffic through the entire fabric. The best and only answer is something I described in my earlier blog posts (check the Rotten Apple series) to shut down the misbehaving port(s). My take is that if a port doesn’t work properly anyway plus it has a significant adverse effect on the rest of the entire fabric it is really a nobrainer to shut it down. Brocade FOS has pretty nifty tools like fabricwatch to get this sorted. It seems that from any logical, technical and reasonable standpoint this is the best option. Until politics get involved and from that point on all hell breaks loose. Every non-related, non-technical and non-reasonable argument is thrown back at you to prevent the automation of these great tools to shut down a misbehaving device. It’s like you know you’re being robbed and the police is ready to intervene and catch these criminals but some politician (or business owner in your case) doesn’t allow that so the robbers get away with the loot.

So the second best answer is that we need to drop frames on the edge switches sooner than on the cores. With the introduction of FOS 6.3.1b Brocade added a parameter which allowed you to adjust this edge hold time from around 100ms to the default 500ms. So this would in essence be a fairly average candidate to prevent these kind of credit back-pressure problems. The reasoning is that if the edge switch does not have any credits to forward a frame to the core-switch in a shorter period than the core-switch needs to forward it on its egress ports, there will be no overload on the core switch and thus no timing issues on that side. This allows the cores to keep moving traffic since those ingress ports will not be  subjected to credit starvation since the frames are already dropped in an earlier stage and the outstanding buffer will be replenished sooner. In the end the result is that non-related traffic therefore will not be affected. There is however one major “gotcha”. This is a global switch setting on condor 2 platforms (which needs to be set on the default switch) and is applied PER ASIC AS SOON AS AN F-Port IS FOUND ON THAT ASIC. This means that if you share F-Ports and E-ports on a single ASIC the EHT is also applied on those E-ports so all returning traffic sitting in that Edge switch ingress port is immediately affected. So it is NOT a good idea to mix E-ports and F-ports on the same ASIC. What also is not a good idea is to have this EHT set too low (or tinkered with at all) on ASICs where target ports are connected since the argument is that one single target port might have a significant amount of fan-in ratio of devices and thus is more susceptible to somewhat elongated response times anyway which might affect timeing issues .

Although the EHT in general seems a great idea it can have some ramifications because what will happen on a very busy fabric where the credit_zero counters are already ramping up with significant speed. Reducing the time on which a frame can be sitting in a buffer might already be a problem. If frames are dropped due to timeouts whilst being allowed to sit around for 500ms will certainly be dropped with a much higher rate if the hold time is reduced.

Now here is where Brocade is currently shooting itself in the foot.

In subsequent versions, especially in the 7.0.1 and 7.1.0 releases they adjusted the default values from 500 to 220MS. ………. (keep thinking………)

Also there is a difference between Condor3 platforms (16G) and Condor2 platforms (8G) on the virtual circuits that live on the ISL’s plus the fact this is now configurable per logical switch (Condor3 platforms that is). Great fun if you have a mixed 8G and 16G blades in your DCX8510

Done thinking????

In general what we see in support is that many edge switches are embedded switches in blade-chassis and these fairly often fall under the responsibility of the server administrators. Now, I don’t want to discredit these fine folks, but in general they do not often take a look at the FC switches in this chassis. The result is that these switches are very often running old code. Now what happens if the storage admin decides to upgrade the core-switches to FOS 7.1.0x.??? The EHT will drop right away to 220ms since that’s the new default on that FOS version. The edge switches still run old code which have a default of 500ms hold-time so in a blink of an eye you now have a reversed EHT fabric where the edge switches (including the ones which might have bad behaving devices attached) are doing their job of trying to send a frame during 500ms but these frames will most likely be delayed much longer on the core-switches due to fan-in ratio and credit back-pressure and thus will be discarded at a much higher rate here than on the edge. Given the fact the EHT does not discriminate between frame on source and destination, all these dropped frames are likely to be from many, many initiators and targets across the fabric. This will results in the upper-layer protocols (like SCSI) having to re-drive their IO’s in addition to the normal workload and thus the parabolic eclipse of misery rises exponentially.

So what is my advice.

  1. First, keep all codelevels as up-to-date as possible.
  2. Monitor for bad-behaving devices, use FabricWatch AND SHUT THESE DOWN with the portfencing feature!!!!!!!!!!!!!! Preventing problems is always better that having to react on it.
  3. Make sure that if you use the EHT you configure it correctly. There are currently too many flaws in this feature that I wouldn’t recommend changing it to another value than the one we have used for decades and that is 500ms. If you fabric is up for it than keep it consistent across all switches in the fabric.
  4. Use bottleneck monitoring and alerting to identify high latency devices. Check these devices for up-to-date firmware and drivers and check if any physical issue might be the cause of the problem. (especially check the enc_out column of the porterrshow output.)
  5. Also use bottleneck monitoring to recover lost credits. Use the “bottleneckmon –cfgcredittools -intport -recover onLrOnly” command to have FOS fix credit issues on back-end links.
From a design perspective we often see schema’s which resemble the picture I’ve drawn above. A core-edge design with servers on one side and target on the other. Personally i think this is the worst design you could think of since it is susceptible to all sorts of nasty thing of which physical issues are the most obvious and easy to fix. Latency and congestion are a very different ballgame so keeping initiators and target as close as possible to each other has always had my preference. Try to localize per port-group, then  ASIC, then blade, then switch and last per hop count. This will give you the best performing and most resilient fabric.

!!!!!!!!!!!   UPDATE April 3rd 2013  !!!!!!!!!!!!

I’ve got some feedback from Brocade based on some questions we had:

In case of existing Fabric where EHT is 500ms and you insert a brand new 8510 in the core, the 8510 is set to 220. What is Brocade’s recommendation?

  • Brocade’s recommendation is to change the EHT for the new 8510 in the core to 500ms
  • Once set to 500ms on the 8510, any firmware upgrades to the DCX or 8510 will retain the EHT settings of 500ms

Can you set EHT on condor3 by port? Where can we check this setting in Supportsave logs?

  • You can’t set EHT by port … it is still one setting for the switch, but that setting will only take effect on individual ports based on the ASIC and port type:
Condor-2
  • ASIC has only F-ports. The lower EHT value will be programmed onto all F-ports
  • ASIC has both E-ports and F-ports.  All ports will be programmed with 500ms
  • ASIC has only E-ports.  All ports will be programmed with 500ms

Condor-3

  • ASIC has only F-ports.  The lower EHT value will be programmed onto all F-ports
  • ASIC has both E-ports and F-ports.  The lower EHT value will be programmed onto all F-ports, and 500ms will be programmed onto all E-ports
  • ASIC has only E-ports. All ports will be programmed with 500ms
  • You can see the current value set by using the following CLI command:

configshow | grep “edge”
switch.edgeHoldTime:500

  • This should also be part of the system group, configShow command in the Supportsave

What is the real default for FOS 7.x  220 or 500?

  • Brand new system, newly installed with 7.X firmware:   Default will be 220
What is Brocade’s Best practice? 220 on edge and 500 in core?
  • Yes, from our testing, we have shown this to be the optimal setting.   

Additional info:

  • Existing DCX upgraded from 6.4.2 or 6.4.3 to 7.X:               Setting will remain at 500ms with one exception :

Upgrading from 6.4.2 or 6.4.3, the 6.x default of 500ms will be retained, with one exception: If the configure command had never been run before, and the very first time it is run is after upgraded to 7.X then the EHT will use the 7.x default of  220ms  (I don’t think this should be a concern because every end user would have run the configure command. I think this would only apply on a new unit shipped with 6.x, never installed/configured, then upgraded to 7.x, then configured. In this case it would appear as a “new” 7.x install with 7.x default).

=================================================================

As mentioned above there will be some separate documentation around this topic. My preference would be to have the ability to segregate core from edge-switches plus manually be able to differentiate between E- and F-ports irrespective of FOS version and/or chip type but I also do acknowledge this might not be a real option on older equipment.

Hope this helps a bit if you get confused by the different options and settings regarding this Edge-Hold-Time.

Regards,
Erwin

PS. The “hold-time” formula I mentioned above is  ((RA_TOV – ED_TOV)/(max_hops + 1))/2 which by default translates to (10000ms-2000ms)/(7+1)/2=500ms

and PPS: pardon my drawing skills. 🙂