Buffer Credits for Newbies

Pretty often you get into some sort of mode whereby I forget that there are many youngsters getting their feet wet in the storage arena and some things are hard to grasp on. Traffic flow is one of them so here we go.

As I described in my “Rotten Apple” series a buffer credit methodology is where one side of a link provides the other side an authorization to send an x amount of frames. The receiving side has reserved a guaranteed amount of buffers to store the incoming frames. As soon as the receiving side has processed the frame or has sent it further down the track it sends a so called “R_RDY” primitive. This special bit-pattern is then recognized by the other side to increase its credits again by one. I always like to bring on some analogy and in this case I use a parking area whereby the entrance is the sending side and the exit is the recieving side. The parking lot has lets say 50 spaces to store 50 cars. It doesn’t matter how big the cars are as long as they are not larger than a predefined size. In our case fibre-channel frames have a maximum size of 2148 bytes (2048 payload, frameheader, footer, crc and maybe an extended header) so lets compare this to a Cadillac Escalade (for non-US citizens look here). In most cases though the payload is way smaller than this so a parking space can also be occupied by a Suzuki Alto (for US citizens look here). Even though the Alto is much smaller than the physical parking space it still does occupy an entire lot.

As mentioned the parking lot has 50 spaces (buffers) and it can receive cars during its operating hours. If the entrance has counted 50 cars it cannot let the next car pass because there is no space. If during the operation period one car leaves the exit tells the entrance (via an electronic message ie R_RDY) so the entrance knows it can let pass another car beyond the original 50 it already had. This can go on and on during the operation period.

There are three ways this can go wrong:

The exit is unable to tell the entrance that a car has left or the message (R_RDY) gets lost. (encoding error/corruption of the R_RDY)
The amount of cars stacking up at the entrance is far bigger than the amount of cars able of leaving. (throughput performance problem)
or the cars cannot exit the carpark quick enough because the exit ramp cannot offload the cars quick enough to the road also causing cars to stack up at the entrance. (Latency)

With option one in Fibre-Channel there are two ways to recover a from this problem One is a credit-recovery mechanism which can also be applied to the car-park which basically lets the entrance and exit tell each other how many cars they have let past. This creates a new baseline and both know whats going on and how many spots are available again. With option two there is really only one method of solving and this is to increase the capacity of both the entrance and the exit in such a way a more efficient flow can accommodate the number of cars going back in and out. With option 3 there is not really a thing you can do from a carp-park (or switch) perspective. The problem is external and thus the issue needs to be addressed over there. Either provide additional exits to different sections of the off-loading road, create different exits to other roads or increase the capacity of the roads. With the car-park analogy you will also see that if people need to wait in front of an entrance they will pretty quickly drop off. With Fibre-Channel this is the same, if a frame cannot be sent to a destination quick enough it will time-out and the switch will discard the frame thereby requiring the upper level protocol (in most cases SCSI) to re-submit the IO request. (A bit like having having the car-park manager asking the people to come-back and re-join the queue.)

In a well-designed FC fabric where initiators and targets behave and no errors are observed you will see that it is very unlikely you will run into these kind of situations.

“And how is this different with Ethernet then?” you likely ask. With Ethernet you do not have a buffer system as described above and as such you have to see this through another analogy. Its like moving water between two basins on different height levels where the upper basin opens a valve and lets water flow to the lower basin. Once the lower basins is full it’ll send a messenger back to notify the upper basin operator to shut the valve (in Ethernet terms this is called a PAUSE). All water still in transit between the upper and lower basin will be discarded via an overflow at the lower basin and thus becomes useless. The same happens with Ethernet. All packets in transit and arriving at the destination port will be sent into lala land. On busy networks this happens an awe-full lot and when you transfer IP packets this is not something to be worried about. IP provides a bullet-proof method of recovering and has it’s own buffering protocols as well. If however you transport SCSI (like with FCoE) you are in for a big roller-coaster ride.

If you want to know more about buffer-credits and fibre-channel frameflow check out the articles that describe this in other article as well as other articles on FCoE.

Hope this helps.

Constructive feedback is welcome in the comment section or in the survey >>here<<

Regards,

Erwin van Londen

5 responses on “Buffer Credits for Newbies”

Pingback: tim_txcrd_z What is it? What does it represent? | Storage & Beyond

J Metz 08/11/2013 at 12:23

You raise some excellent points, as usual. Obviously, I can’t speak to all vendor implementations on the planet 🙂 because ultimately what it boils down to is whether or not vendors err on the side of caution. (Full disclosure to your readers I am a PM for storage for Cisco’s Nexus switch product line).

The FCoE frame size is fixed as per the standard, so with respect to running a mix of larger and smaller frames on the same lossless Class of Service (COS), this would require the additional placement of lossless frames (RoCE?, iSCSI frames?) into that same COS. I know, though, that when FCoE is configured for lossless COS no other traffic type is allowed on that same COS.

You raise an excellent point about other types of lossless frame sizes, however. I’ve always been highly suspicious about the efficacy of putting iSCSI into a lossless environment, but your point about different sizes is particularly apt if people start playing around with 9k sizes.

Your point about needing to resend a PAUSE frame to prevent resumption is something that has always puzzled me – it’s actually built into the standard spec. I always thought it would be more appropriate to have an UNPAUSE frame sent back instead, but I’m sure that was discussed before I became involved in the standard body. Next time I go I’ll ask, and maybe I’ll have a logical reason to share. (Whether we agree it’s the correct thing to do or not is a different story 🙂

Realistically, however, the key factor here is one of distance, as often the HWM is hard-coded at the maximum allowable threshhold for potential disruption; it’s not a dynamic setting. In other words, we take the worst-case scenario and use that conservative limit as the starting point.

Having said that, of course, the behavior you describe – while certainly possible – doesn’t appear to be observed in practice. Do you have test results (or production environments) that have experienced this? I have to confess that over the past 6 years of doing FCoE (from pre-beta days) I’ve never actually seen this occur, personally. I would be very interested to see or hear anything about this in real environments.

Best,
J

Erwin van Londen Post author08/11/2013 at 13:00

Hi J,

From what I know the MAXIMUM frame-size is fixed. That does mean that if the payload is significantly less than the max net payload of 2048 bytes for FC (which around 98% of all frames are) you could run into the issue I described.

W.r.t. the UNPAUSE (or PAUSE_0), usages indeed depend on implementation. If the “timed-pause” is used whereby the CEV depicts the COS and one of the Pause_Time field regulates the pause period you could create some granularity however this still needs to be tightly controlled. The problem with receiver controlled flow-management is always that you never know whats coming. On short distances this obviously is far less likely to occur and it may be the reason you haven’t seen it. It may become even more of a problem in environments where congestion comes into play (on a single COS) since that is when the PFC starts kicking in.

I don’t have an environment to my disposal where I can create such scenario’s so I’m not able to provide you with an example.

Thanks for posting your comments. Always great to discuss technicalities with people who know stuff.

Regards,
E

J Metz 06/11/2013 at 12:44

Hey Erwin,

Great analogy. One quick addendum regarding your water basin analogy for Ethernet. Your water metaphor is appropriate, because there is a “high water mark” in the ingress buffers on ASICs capable for lossless traffic. This is calculated based upon a maximum distance by which the PAUSE frame can be sent back to the source in order for all the water in the pipe to be satisfactorily stored in memory without discarding frames.

This is why distance becomes somewhat of an issue, as you know. At the risk of piggy-backing on your well-written article, your readers may find some additional information useful as well: http://blogs.cisco.com/datacenter/storage-distance-by-protocol-part-iv-fibre-channel-over-ethernet-fcoe/

Again, really well done.

Erwin van Londen Post author06/11/2013 at 13:19

Hi J, thanks for the compliment. As for piggy-backing I think I’m going to do the same. 🙂 In you article you describe the receiving ASIC maintaining a High Water Mark (HWM) after which it will send a PAUSE. The problem still is that the receiving side has no clue about whats coming w.r.t. the size of the frame. It can be a full FCoE frame of 2180 bytes but it can also be much smaller which, evidently, decreases the effective utilization of the link significantly. This is also true for pretty spiky workloads. You may run into the chance that a PAUSE is sent but the TX side did not transfer anything right after the burst. This basically means that you do have a TX side sitting IDLE, plus an empty link for too long whilst the RX side is still working on getting below the HWM before it will send a “UNPAUSE”. It then still requires the UNPAUSE to arrive at the TX side plus the transfer delay of the long-distance link before effective use of the link is re-established. As mentioned, if this is a spikey workload you might very well run into performance issues. I dare to say this is imminent. The second issue this will incur is that both initiators and targets are designed and build with fairly tight timings in which they need to process IO’s. Whether this is on the array regarding onload/offload to cache and disk or at the server side on the OS stack. If HWM’s are reached in the middle of a FC exchange and FC sequences are on hold for too long because of the above mentioned problem you will see that significant delays will happen plus secondary timers like E_D_TOV or HOLD time are reached. This then has the ramification that entire IO’s will need to be re-transmitted.

I guess there is not a golden answer to every problem and that each solution should be carefully weighed for pro’s and cons.

Cheers buddy.
Hope to see you soon Down Under. 🙂

Pingback: tim_txcrd_z What is it? What does it represent? | Storage & Beyond
J Metz 08/11/2013 at 12:23

You raise some excellent points, as usual. Obviously, I can’t speak to all vendor implementations on the planet 🙂 because ultimately what it boils down to is whether or not vendors err on the side of caution. (Full disclosure to your readers I am a PM for storage for Cisco’s Nexus switch product line).

The FCoE frame size is fixed as per the standard, so with respect to running a mix of larger and smaller frames on the same lossless Class of Service (COS), this would require the additional placement of lossless frames (RoCE?, iSCSI frames?) into that same COS. I know, though, that when FCoE is configured for lossless COS no other traffic type is allowed on that same COS.

You raise an excellent point about other types of lossless frame sizes, however. I’ve always been highly suspicious about the efficacy of putting iSCSI into a lossless environment, but your point about different sizes is particularly apt if people start playing around with 9k sizes.

Your point about needing to resend a PAUSE frame to prevent resumption is something that has always puzzled me – it’s actually built into the standard spec. I always thought it would be more appropriate to have an UNPAUSE frame sent back instead, but I’m sure that was discussed before I became involved in the standard body. Next time I go I’ll ask, and maybe I’ll have a logical reason to share. (Whether we agree it’s the correct thing to do or not is a different story 🙂

Realistically, however, the key factor here is one of distance, as often the HWM is hard-coded at the maximum allowable threshhold for potential disruption; it’s not a dynamic setting. In other words, we take the worst-case scenario and use that conservative limit as the starting point.

Having said that, of course, the behavior you describe – while certainly possible – doesn’t appear to be observed in practice. Do you have test results (or production environments) that have experienced this? I have to confess that over the past 6 years of doing FCoE (from pre-beta days) I’ve never actually seen this occur, personally. I would be very interested to see or hear anything about this in real environments.

Best,
J
1. Erwin van Londen Post author08/11/2013 at 13:00
  
  Hi J,
  
  From what I know the MAXIMUM frame-size is fixed. That does mean that if the payload is significantly less than the max net payload of 2048 bytes for FC (which around 98% of all frames are) you could run into the issue I described.
  
  W.r.t. the UNPAUSE (or PAUSE_0), usages indeed depend on implementation. If the “timed-pause” is used whereby the CEV depicts the COS and one of the Pause_Time field regulates the pause period you could create some granularity however this still needs to be tightly controlled. The problem with receiver controlled flow-management is always that you never know whats coming. On short distances this obviously is far less likely to occur and it may be the reason you haven’t seen it. It may become even more of a problem in environments where congestion comes into play (on a single COS) since that is when the PFC starts kicking in.
  
  I don’t have an environment to my disposal where I can create such scenario’s so I’m not able to provide you with an example.
  
  Thanks for posting your comments. Always great to discuss technicalities with people who know stuff.
  
  Regards,
  E
J Metz 06/11/2013 at 12:44

Hey Erwin,

Great analogy. One quick addendum regarding your water basin analogy for Ethernet. Your water metaphor is appropriate, because there is a “high water mark” in the ingress buffers on ASICs capable for lossless traffic. This is calculated based upon a maximum distance by which the PAUSE frame can be sent back to the source in order for all the water in the pipe to be satisfactorily stored in memory without discarding frames.

This is why distance becomes somewhat of an issue, as you know. At the risk of piggy-backing on your well-written article, your readers may find some additional information useful as well: http://blogs.cisco.com/datacenter/storage-distance-by-protocol-part-iv-fibre-channel-over-ethernet-fcoe/

Again, really well done.

J
1. Erwin van Londen Post author06/11/2013 at 13:19
  
  Hi J, thanks for the compliment. As for piggy-backing I think I’m going to do the same. 🙂 In you article you describe the receiving ASIC maintaining a High Water Mark (HWM) after which it will send a PAUSE. The problem still is that the receiving side has no clue about whats coming w.r.t. the size of the frame. It can be a full FCoE frame of 2180 bytes but it can also be much smaller which, evidently, decreases the effective utilization of the link significantly. This is also true for pretty spiky workloads. You may run into the chance that a PAUSE is sent but the TX side did not transfer anything right after the burst. This basically means that you do have a TX side sitting IDLE, plus an empty link for too long whilst the RX side is still working on getting below the HWM before it will send a “UNPAUSE”. It then still requires the UNPAUSE to arrive at the TX side plus the transfer delay of the long-distance link before effective use of the link is re-established. As mentioned, if this is a spikey workload you might very well run into performance issues. I dare to say this is imminent. The second issue this will incur is that both initiators and targets are designed and build with fairly tight timings in which they need to process IO’s. Whether this is on the array regarding onload/offload to cache and disk or at the server side on the OS stack. If HWM’s are reached in the middle of a FC exchange and FC sequences are on hold for too long because of the above mentioned problem you will see that significant delays will happen plus secondary timers like E_D_TOV or HOLD time are reached. This then has the ramification that entire IO’s will need to be re-transmitted.
  
  I guess there is not a golden answer to every problem and that each solution should be carefully weighed for pro’s and cons.
  
  Cheers buddy.
  Hope to see you soon Down Under. 🙂