Sigh….. yet another post to debunk some FCoE marketing and explain in some more details why FCoE is not good for Business Continuity (or business in general).
In one of my my previous posts I highlighted the inherited problem that Ethernet has regarding flow control. Ethernet uses a PAUSE mechanism (or Priority PAUSE on 10GE) which basically means that if the receiving side is short on buffers it notifies the remote side with a PAUSE frame to stop sending frames. If it is able to receive frames again it will send a UNPAUSE to notify the remote side to start sending again. This also means that during the “PAUSED” state, every frame entering the port will be sent into lala-land and hence is discarded. As with Fibre-Channel there is no inherent recovery mechanism on this level so every recovery attempt is relying on upper level protocols. On IP networks this is no problem since there is such a huge amount of bulletproof recovery mechanisms built into TCP/IP that you will see no, or negligible, issues. On a serial channel protocol like Fibre-Channel which transports a half-duplex protocol such as SCSI you might get away with one or two frames but if you lose more then you will see a lot of issues which will results in performance problems, IO errors and most likely data corruption.
The problem is that with increased distance the number of frames being discarded becomes proportional to that distance. With the flow-control mechanism Fibre-Channel uses (Buffer-to-Buffer) this is different because there is only so much frames on the wire as the remote port allows.
So lets take an example here for synchronous replication over a medium distance (5KM). On average a frame is around 800 to 1000 bytes. (Don’t assume that a frame is always the full size it can be (2148 bytes). The SCSI command, XFRE_RDY and status frames are all very small and the data frames depend on the application so for a simple database query you’ll see a payload of around 512 bytes. Only applications of high throughput like back-up, video, asynchronous journal based replication will see full frame-sizes) The refraction index of an optical cable is around 1.47 which means the maximum speed of light in that cable is around 203000 KM/s, this means that on a 10GE transmit rate it takes around 8.735E-07 seconds for a 900 bytes frame to be yanked on the wire. This then translates to the length of that frame being ~178 meters, so on a 5KM wire there are potentially around 28 frames on the wire at any given point in time. If the receiving side then sends a PAUSE, ALL these 28 frames will be discarded (if no UNPAUSE is sent beforehand). This means from a storage perspective that ALL IO’s which have one or more frames in this time-frame will need to be re-tried by the initiator. Both Ethernet and FC are indiscriminate which frames it will discard, no buffer-space to store a frame is independent of source- and destination address. For a synchronous replication setup this also means the two (or more) storage arrays will need to retain all status information of every IO both from a host perspective but also from a source volume and destination volume perspective. This increases load on all levels of the array and IO paths including processor workload, cache maintenance, timers etc etc… All of these will incur a negative influence on host performance and reliability.
A 5KM distance is not that far. The majority of synchronous replication environments sit between 6 and 10KM. Obviously the number of frames on wire depend on the frame-size, speed and distance but also the quality of an optic cable is of influence.
Now, you might think that an a-synchronous replication methodology negates the previous arguments so using FCoE for replication is a viable option but that is incorrect. Asynchronous replication is used in long distance replication scenarios and also relies either on large cache sizes or journals on disk or a combination of these. So even though I might use large frame sizes the distance and PAUSE/UNPAUSE mechanism still puts a strain of the link. So for example if I use a long distance link of 40KM with a maximum FC frame-size of 2148 bytes I still have ~94 frames on the wire. If any congestion occurs on the remote side for whatever reason and the PFC mechanism decide to halt the reception of frames by sending a PAUSE it will send those 94 frames (or any amount until it has free buffers and the UNPAUSE is sent.) into /dev/null .
FCoE is NOT a viable option for data replication in its current form. The only way it might circumvent the issues mentioned above it to pre-emptively send PAUSE frames based on the monitored frameflow and remaining available buffers. This is still no guarantee that frames will not be discarded and it will also reduce the effective actual capabilities on that link. A receiving port doesn’t have a method to predict the amount of frames underway further upstream and especially on medium- and long-distance links it will require a significant investment in equipment and resources.
So before you start asking your vendor to support FCoE for replication purposes be aware that the inherent limitations of FCoE basically prevent this. If you still do require FCoE, for reasons I still cannot comprehend, make sure you leave the replication part (Synchronous or A-synchronous) to FC or FCIP.
Erwin van Londen
I received some comments from some very respected engineers from Cisco (who else. :-)). Landon noted my calculation on the 40KM distance was wrong and indeed I am. The comments comes from Landon Curt Noll, a cryptologist and number crunching mathematician, thus I very well stand corrected. I did the calculation again and indeed I must have missed a formula mistake in my spreadsheet. I now come to ~114 frames. (I bloody well hope this is correct. :-))
Secondly Landon questioned my theory on PAUSE and UNPAUSE on DCB PFC enabled links. PFC is an extension known as on 802.1Qbb. The PAUSE and PAUSE-0 frames are sent to a special multicast address responsible for MAC layer functions. Now, technically there is no UNPAUSE packet. The “UNPAUSE” is a normal PAUSE packet with a zero time value for one or more lanes. DCB has a optional primitive called PFCLinkDelayAllowance which provides the opportunity to adjust buffer space on the receiving side based upon link-characteristics like for instance delay. On longer distance methods like HWM’s may circumvent the problem but that also means resources are likely under-utilised. The problem still exists though that the characteristics of the data size fluctuate which means that the receiving side still does not know the size of the packets on the wire and thus the number of packets that are potentially underway. The only way to circumvent this is to cater for a worse case option which means that valuable resources on the receiving side might never be used.
My view on FCoE has not changed. I’m fine with it from a technology perspective but keep it in a confined space like the UCS with zero-administration requirements from an operational standpoint.