Images tagged "electricity-pylon"

0 responses to Images tagged "electricity-pylon"


  1. Yury Yavorsky

    the first step is done. They will eliminate current education/certification program. No ILT, no labs, no serious certification…just WBT with simple tests. Someone might say it’s my Brocade instructor point of view. But I think it’s Broadcom’s lack of expertise in end user products. They simply don’t know how to sell such complicated products, teach to manage them and support.

    • I’m afraid you may have hit the nail on its head. I certainly hope not but we’ll just have to wait and see. Broadcom simply has a history of engineering OEM relationships and have never seen a customer environment before.

  2. nobody

    I like your style of writing and summary of FOS from 8.0.0 to 8.0.1!

    I’d enjoy reading an update of FOS 8.0.1 to 8.1, now that it is out.

  3. Lucas

    what about this latest news… ARRIS to Acquire Brocade’s Ruckus Wireless and ICX Business
    prepared for brocade certs… but now its wasting money…

    • Hi Lucas,

      Indeed. You just may hope that Arris takes over any qualification and certification you may already have obtained but I do question the current market value as Arris certifications seem to be non-existent. This would indeed mean that any certification you may hold on the Ruckus/ICX side of Brocade’s business is void and yes, most likely has been a waste of time and money.

  4. Moon

    I have lost access to this Config section paid yesterday just after clicking a link to a topic from Troubleshoouting section and following a free registration

    It is ugly!!
    (trying to post from a longly opened browser window)

  5. good review/look at it, assuming the IP business stays whole Great point on the Certs :-/

  6. Paddy

    Great info Erwin. Cheers, Paddy

  7. anonymous

    “I wouldn’t be surprised if an anouncement is made by Broadcom and Qlogic for a similar merger. With the acquisition of Brocade they not only leveled the playing field but leapfrogged Avago here by several miles.”

    This seems a rather strange statement since Avago purchased Broadcom last year so they are one and the same company.

    • Indeed but the problem is they are still different entities. Engineering efforts are not always shared. We’ll see how it pans out. If Broadcom also decides to buy Qlogic because of the Brocade technology (Brocade sold the HBA business to Qlogic) there will no more competition left in that area. Good this is that FC HBA’s can then finally be seen as a mass-commodity and prices can drop sharply as it has done with Ethernet NICS before. The HBA ASICS can then simply be OEM’ed to whoever wants to bring it to market including all the current networking and storage vendors. An exciting time to see how this goes in the near future.

  8. Pingback: tim_txcrd_z What is it? What does it represent? | Storage & Beyond

  9. Pingback: tim_txcrd_z What is it? What does it represent? | Storage & Beyond

  10. equathza

    Thanks Erwin, these stale entries indeed prevented me to perform an upgrade.

  11. marksman

    Can old fashioned zoning and TDZ/PZ coexist in the same fabric?

    Thanks.

  12. jaymike

    Good stuff, Erwin.

    Can you elaborate on this section a bit?

    “…simply hooking these [4G/8G devices] up to a new 16G or 32G switch will most certainly provide you headaches with performance (or rather the lack thereof) and wide-spread fabric congestion and latency issues. At some stage, sooner rather then later, it will cause significant problems.”

    Is the implication that if you mix these in with newer 16G/32G N_Port devices that you’ll run into trouble? Or is this more a statement about possible issues with cable type/quality? Just looking for a little clarification.

    I used to support fabrics from back in the 1G days until just after the 16G stuff came out, but have been out of the game for awhile. I still work on fibre target devices though, so I like to try to keep up. Thanks.

    • Hi Jay,

      The problem is two fold. First the change in physical cabling requirements. 16G and 32G speeds simply require OM3 or better OM4 cabling. No questions asked. If you want to drive a Ferrari to go 250mph you need a good road. Same goes for these high speed links.

      Secondly the chance of credit back-pressure when mapping 4G devices to each other which traverse ISL’s or switch back-end links who also need to push frames from brand new kit will most certainly happen. You will run into scenarios like I described here http://44.230.142.87/2015/04/cross-fabric-collateral-damage/

      The basic rule of thumb I always use is to have a maximum on 1 speed generation difference between any piece of equipment. Remember that speed increases are exponential in the FC world and so will the amount of performance problems if mixing old and new equipment.

      Hope this helps.

      Cheers,
      Erwin

  13. Pingback: Brocade FOS version 8 and 32G hardware | Storage & Beyond

  14. At large scale, I would hope that outside orchestration would be solving the complexity problem.

  15. actually I have several customers that are pretty excited about it.

  16. pvwoude

    Erwin,

    Does the entire fabric have to be running FOS 7.4.x before you can use TDZ? I have some old brocade 4020 in ibm blade chassis’, that are stuck at 6.2.2.

    Peter

  17. Tuco Salamanca

    Very very useful article. Never found in Brocade a document where the initiation of a port were better explained. Thanks a lot

  18. F1 has been lost at sea 4 years. It’d be a better sport if they raced Hyundai Getz around the Woolies car park with big spoilers πŸ™‚

  19. band20

    excerpt from Brocade San Fabric Resiliency best practice: “Once set by the configure command, the EHT value will be maintained across firmware upgrades, power cycles and HA Fail-Over operations. This is true for all versions of FOS.”
    So the example you described above “Now what happens if the storage admin decides to upgrade the core-switches to FOS 7.1.0x.??? The EHT will drop right away to 220ms since that’s the new default on that FOS version.” will never occur according to Brocade.

    But I see a different challenge, which I have faced. People who installed new hardware in their infrastructure were not aware of eht and used default 220ms (partly on the core switches and partly on the edge), and they still have a lot of old switches with eht set to 500ms at the edge. As far as I understnad we can not change eht on the fly. only by disabling the switch. so what should these guys do now?!

  20. band20

    Hi Erwin. And one more question: do you know if Golden eye 2 supports Edge hold time?

  21. Felix Benadis

    Good info. Erwin. Thanks !
    But issue is not stopping only on this ( APM ) level. If there no timebase configured for some mapsrule(s), then also it failing.
    Example error during upgrade from 7.3.x to 7.4.x:
    =======================================
    The following item(s) need to be addressed before downloading the specified firmware:
    Invalid MAPS rule(s) in FID 128:
    fw_cust_SWITCHDID_CHG_AH_0
    fw_cust_SWITCHEPORT_DOWN_AH_0
    fw_cust_SWITCHFAB_CFG_AH_0
    fw_cust_SWITCHFAB_SEG_AH_0
    fw_cust_SWITCHFLOGI_AH_10
    fw_cust_SWITCHZONE_CHG_AH_10
    There’s no timebase for the above MAPS rule(s). Please use “mapsrule –config -timebase ” command to correct the same.

  22. Sebastien Herrmann

    i do agreee with Erwin on that. Better focus on correct routing configuration than think that Traffic isolation will really protect better our bandwidth.
    We have been using SAN routing since year (2008) to interconnect our Datacenter through MAN link but we wanted to segregate the Tape and Disk traffic to different Path.
    We start using Zone Traffic Isolation for EVA replication and we pushed Brocade to accept the Fabric Level Traffic Isolation for tape traffic as we had too many source and target (Lanfree agent to Tape devices) to do that zone by zone (even forced them to write it down in the Admin guide).
    Now that we move to 7840 with a mix of FCIP and IP extension we will drop this very complex feature that was better for the sake of management than for real protection of our prod environment.

  23. Stephen Bracken

    I am currently using peer zoning in some of my development environment fabrics and it is a really great addition and one that should have been implemented a long time ago.
    The one major problem with peer zoning in its current format or at least in Brocades implementation of it, is that it will not allow the use of aliases. Which I’m sure you’ll agree, increases the administrative burden that it was meant to ease and reduces user friendliness .
    I can only hope this gets resolved in the very near future before I will consider the migration of zones from traditional to peer in my production environment.

    • Hi Stephen,

      I agree that the use of aliases would have been a nice addition. You have to remember though that peer-zoning is simply piggy-backing on TDZ actually being the underlying idea. You don’t need any administration for that on the switches as it is the arrays who will add and remove zone-members as needed. Peer-zoning is in my view a simple intermediate step until all the array vendors start supporting TDZ.

    • Hello Stephen,

      A short followup on your comment above. As you may have seen Brocade added the alias naming convention to the peer zoning functionality in FOS 8.1.0. You just have to wait a short time now till the OEM’s have qualified this version.

      Hope this helps.

      Cheers
      Erwin

  24. Alexey Farenyuk

    Hi Erwin. Some of our switches show value of “0” – switch.edgeHoldTime:0. what does it mean? they are models 5100 FOS 7.1.1c and 5470 FOS 6.4.2b4.

    • Thats simply means they use the default of what the FOS-code is at that time. There have been some changes in that value over time. It’s best to set it manually. Be aware that on pre-condor3 ASIC’s the value is set per ASIC and not per port.

  25. svwsvwsvw

    How many frames exactly in the buffer that the switch can hold?

    • I must say you have a peculiar name.

      My name is Erwin, I think I’m a nice guy and would like to speak on a first name basis so if you would provide your name that progresses these kind of conversations much better.

      First of all your question is incorrect on two parts. You did not mention which switch and secondly it is not the switch that has the buffers but each ASIC in a switch or blade. This then lines out the number of buffers in a switch but these are not shared in a single large pool but more per port-group. As you question is unclear I’m unable to answer it.

      Hope this helps.

      Regards
      Erwin

    • svwsvwsvw

      Hi Erwin

      Thank you for your time to reply my questions. You may refer me as Shafiee, sorry for the silly question. i’m an amateur in SAN with only +2 yrs experience previously i’m a Unix Guy πŸ™‚

      In our current environment we have HP Brocade B-Series:

      0. We have DCX, DCX 8510-8 & 5300.
      1. Currently running on FOS v7.1.1a.
      2. We have 4 Virtual fabric in total & are separated:
      2.1 VF Trunking & ICL
      2.2 VF External Storage – XP having external storage from EVA
      2.3 VF Production – Server & Storage
      2.4 VF 128 – Default
      3.Fabric Watch enabled.

      So each front end & back-end ASIC has buffer credits, if so how much? Just curious to know.

      Recently we had performance issue so i came around your post “One rotten apple spoils the bunch” Pt 1 – 4, been reading it for a couple of days. FYI the issue is currently ongoing at the moment.

      From my observation:
      1. disc c3 value kept increasing on affected device connected after running portstatsclear & run porterrshow on port in question.
      2. From BNA i can see there’s a perf bottleneck congestion alert.
      3. When running errdump -r on the switch, I came across this two important message.

      2015/11/24-13:39:28, [C2-1014], 646086, SLOT 7 | CHASSIS, WARNING, XXXXCORE001, Link Reset on Port S8,P-1(57) vc_no=15 crd(s)lost=3 auto trigger.
      2015/11/24-13:39:28, [C2-1012], 646085, SLOT 7 | CHASSIS, WARNING, XXXXCORE001, S8,P-1(57): Link Timeout on internal port ftx=336197441 tov=2000 (>1000) vc_no=15 crd(s)lost=3 complete_loss:1.

      BR

      /Shafiee Van Weiringen

  26. Bacil

    Hello,

    This may be slightly off topic. In another case the extended distance trunks(invovled DWDM) were formed successfully. However, if is being noticed that it is utilizing almost one link of the trunk as against distributing it dynamically(and in-order) between the links(2 extended ISL links).We are sure though that it will re-direct in case the first one fails. but why not distributing it already in the first place… Please advise….Thanks Bacil.

    • Hi Bacil,

      As you are aware traffic is dispersed over ISL when Exchange based routing is enabled. When an ISL consists of trunks for FSPF this is one link so from an official Fibre-Channel standards point of view there is not much that can be done and it is up to the implementation of the vendor.

      As for you question as of FOS 6.2.(something) Brocade changed the switching algorithm in the ASIC driver in a way that a threshold mechanism was introduced regarding load balancing. The challenge has always been to keep frames in order for each FC exchange. If you just have a single link (non-trunked ISL) there is not really a problem as FSPF will take care of that as the entire exchange is mapped onto a single ISL. If you have two or more physical links on a single ISL you need to make sure to prevent Out-of-Order frames as much as you can. The easiest way to do this is to keep frames using the same physical link as it is therefore impossible to get frames out-of-order. On 1, 2 and 4G speeds the lentgh of each individual frame was long enough for the receiving side to keep things in order as it would not overlap if the cable length difference was short enough. On 8G and now 16G this may not be 100% the case and preventative measures where taken by the Brocade FOS engineering team.

      The driver now makes sure to map all frame onto the same physical link until a certain utilisation in the form of bandwidth or latency has been reached and then starts using the other links in that trunk on a per exchange basis. This way the chances of getting Out-of-order frames in the same exchange is next to nothing.

      I got this question a lot in the past and it is just a cosmetic thing when you see one link utilised. In reality there is no difference in the total amount of data being pushed through the entire trunk. It should just be seen on the whole trunk and not the individual members.

      Hope this explains it a bit.

      Regards,
      Erwin

  27. jaymike6

    Great post. I supported fabrics for Sun/Oracle for 12+ years and remember learning about FDMI so long ago and being excited about the possibilities it could bring to administration and troubleshooting. Alas, it never seemed to really catch on and I’m pretty sure Q still does not include it in their Solaris driver and Emulex’s implementation (for Solaris) just gives some of the basic static info.

    • Indeed, so much more administrative and troubleshooting information could be used from just having this feature enabled. It is indeed a shame that Oracle does not instruct Emulex and Qlogic to include the options in their drivers. The code is all on a FC3 level which both vendors already have on numerous platforms so I’m quite curious why Solaris is being kept out the zone…

  28. Bacil

    Hi, We want to use MAPS features in new FOS. However, few of our switches still have FOS below 7.2(actually v6.4.XxX) and due to hardware limitations on those switches(48Ks), we can’t have new FOS for MAPS, and we still have to live with them for some more time. My question is , can we have in a same fabric , MAPS on all and two three switches with still Fabric Watch. Okay, while converting Network Advisor will complain for them, but is there any other drawback apart from that. I mean since MAPS or FW are configured on switch basis, so at the most we will be unable to manage them with MAPS in Network Advisor. Apart from that what other issues you see in this scenario…
    I also found this –
    “Fabric capability is based on the least capable switch participating in the fabric. If a fabric has products participating that are operating with an older version of Brocade FOS, the limits of the fabric must not exceed the maximum limits of that older version of Brocade FOS.”
    But then again, I thought it will not be managed by MAPS, that’s it. We will manage it via CLI and Fabric Watch alerts anyway we will receive via email alerting set on those individual switches.
    So, what are your thoughts on “mixed” non-supported environment like this ?

    Thanks,
    Bacil.

    • Hello Bacil,

      If I were you I would shift my focus on replacing these 48K’s to newer equipment as they are simply not longer supported as of January which leaves you no option to open any support-cases with any of the OEM vendors nor Brocade.

      As for the use of FW and MAPS you basically can not use any fabric wide feature of MAPS or FW. You can obviously still use the features and functions that monitor and action an individual port or switch. There is no restriction using these in a mixed environment.

      Once again, please get rid of these 48K’s especially when they are connected in mixed fabric you basically render your entire environment unsupported. You certainly don’t want that. Believe me I have experiences in that where a customer had even 24K and 12K directors installed not too long ago. His environment went down hard without us being able to do anything about it. This became a very costly exersize as all equipment had to be replaced in addition to numerous down-time occurrences.

      Hope this answers your question.
      Erwin

  29. Bacil

    Hi, We want to use MAPS features in new FOS. However, few of our switches still have FOS below 7.2 and due to hardware limitations on those switches, we can’t have new FOS for MAPS, and we still have to live with them for some more time. My question is , can we have in a same fabric , MAPS on all and two three switches with still Fabric Watch. Okay, while converting Network Advisor will complain for them, but is there any other drawback apart from that. I mean since MAPS or FW are configured on switch basis, so at the most we will be unable to manage them with MAPS in Network Advisor. Apart from that what other issues you see in this scenario…
    I also found this –
    “Fabric capability is based on the least capable switch participating in the fabric. If a fabric has products participating that are operating with an older version of Brocade FOS, the limits of the fabric must not exceed the maximum limits of that older version of Brocade FOS.”
    But then again, I thought it will not be managed by MAPS, that’s it. We will manage it via CLI and Fabric Watch alerts anyway we will receive via email alerting set on those individual switches.

    Thanks,
    Bacil.

  30. Spencer

    Good post, but with one erroneous entry. The “firmwarecleaninstall” will not let you update from “stoneage” FOS to 7.4. From the command reference guide:

    “Use this command to initiate a clean reinstall of the firmware in cases where the loaded firmware does

    not function correctly, the normal firmware download fails, or to recover from a rolling reboot situation.”

    It is only used in case your 7.4 firmware is all messed up; not to update lower level FOS versions.

    • Hello Spencer,

      You are correct. I thought I was clear in my description but apparently there is room for misinterpretation. Obviously when a feature is introduced in a certain revision I assumed it was clear that this is not available in preceding versions.

      So to be clear a switch has to run at least FOS 7.4 before being able to use this option. Future releases will then be able to utilise this in case things have gone horribly wrong for whatever reason.

      I still have an RFE out to be able to boot a switch over tftp right out of the bootprom and recover from there. Haven’t heard about it for a long time. Will send them a note around this..

      Thanks.
      Erwin

  31. Yury Yavorsky

    Suppose that there is need for checking HBA queue depth

    • Hi Yury,

      That may well be the case. We haven’t been given much info on this system when I wrote this article. max_execution_throttle was not set so I don’t know what the default is on this system.

  32. Yury Yavorsky

    Enhanced Group Management license is needed for some BNA management tasks. It allows you to perform operations on a group of devices. For example, recently I was able to upgrade firmware on all switches in the fabric at once except one with the lack of the EGM license. I suppose that massive config upload and download requires the EGM as well.

    • Hi Yury,

      I’m sure you’re right and there may be other tasks/feature that require this license however I haven’t been able to track down a detailed list. Anyway on all 16G platforms it is included as well as all director class switches from FOS 7.x and above. Most of the time the Enterprise bundle is purchased anyway which also includes this so in the end most of us will not have any issues with that.

      Thanks
      Erwin

  33. Yury Yavorsky

    Hi Erwin! Two questions:

    1) Why don’t u multiply 346392029 by 4 as well? This counter is for 4-byte words.

    2) Why u multiply 55*4294967295 by 4? I think u shouldn’t as these values are for frames

  34. AlexeiStepanov

    Thanks Erwin, a good read, as always. However, this is not the only problem. See DEFECT000491910. We suspect that we’ve hit this one recently.

    • Hello Alexei,

      This may indeed be a problem. As always, maintaining code-levels can prevent these issues.

      We’ve qualified 7.3.1a for a while (as of this writing) and this particular defect is closed with code-change on that level.

      Thanks for your comments. Highly appreciate3d as always.

      Regards,
      Erwin.

  35. Igal

    hello Erwin,
    I have downloaded some supportsaves (from two Brocade switches), in order to troubleshoot a LUN disconnect.
    Do you know if there is a Tool that makes sense of all list of logs in it?
    basically I am looking for the Log showing any port status change within the last week.
    in which Log should I look for?

    • Hello Igal,

      I wish there was. The tools that are there are a collection of scripts and tools but these are all Brocade or OEM restricted. The log you’re after is displayed on the screen if you type in “fabriclog” in the CLI. This provides you with a list of all port-state-machine changes and is basically a circular log which means it has a finite number of entries (that depends on type of the switch and the FOS code revision) but in general contains enough entries to go back a fair bit in time. That is unless you have a very active fabric where lots of hosts and link issues are ongoing.

      Hope this helps.

      Regards,
      Erwin

      • Igal

        yes, that helped, thanks!
        Unfortunately, the log was full –
        Number of entries: 1024
        Max number of entries: 1024
        so I am not sure I see the latest entries (the last is from Dec 2014):
        Switch 0; Wed Dec 10 08:42:19 2014 GMT
        08:42:19.745618 SCN Port Offline;g=0x17a D2,P0 D2,P0 21 NA
        08:42:19.745636 *Removing all nodes from port D2,P0 D2,P0 21 NA
        08:42:19.813833 SCN Port Offline;g=0x17c D2,P0 D2,P0 20 NA

        etc…

  36. Raef Mansour

    Looks like a very improved version the 7.4.
    Not yet tested…I will add comments when done.
    thanks for the post Erwin.

  37. Tony Chubb

    I’m seeing mass errors across my Brocade encryption switches currently all ports are set to 0 idle/ildle. The problem we are seeing is when a cfgenable is run some of the esx hosts see their storage connections become unstable. To recover the issue the port is disabled and the device recovers on the other connections to the SAN. Currently the storage is all HDS (vsp and hus) the major issue appears to be with the HUS as connections stay alive to the VSP.

    We are planning to change the HDs devices to portcfgfillword 2 and the ESX hosts to 3. Is this a red herring or are we following the right path. any advice appreciated.

    Thanks

    • Hi Tony,

      In order to rule out incorrect fillwords having an effect you should set them according to manufacturers guidelines. You didn’t really specify WHICH errors you see as a bad-ordered-set is not necessarily a hard error but it can have a negative effect which then results in these hard errors.

      Let me know how it goes.

      Regards,
      Erwin

      • Tony Chubb

        hey Erwin the only error we are seeing is “er_bad_os 730762736 Invalid ordered set” the stat was cleared on Thursday last week and is happening on 20 ports on the switch and seem to be about 1 billion per port/day. We are working to fix them and should have them done by the end of the week.

        • evlonden

          Yes, that is the expected counter to ramp up when an invalid fill-word is set. If you set the correct fillword as I described these will stop to accumulate.

  38. AlexeiStepanov

    Thank you for such a good summary of all the development in this area in the last 3-4 years!
    The only thing that is missing is … just a little bit of spell-checking πŸ™‚

  39. saiyed

    NPIV with Brocade 6240 and FlexFabric not working

    Guys i have to migrate from B5300 to 6240 switch and when a working NPIV port was moved to 6240 i am not getting any wwn logged into behind the flex fabric.

    1. Port itselft is online and logged in

    2. NPIV is on, NPIV PP Limit =126, NPIV FLOGI Logout ..

    3. FOS 7.3.1 on new switch, fill word not required with this version of fos and is depreciated.

    4.portFlags: 0x24b03 PRESENT ACTIVE F_PORT G_PORT U_PORT LOGICAL_ONLINE LOGIN NOELP LED ACCEPT FLOGI ( Notice missing NPIV)

    5. NPIV is ON on the port.

    I think the Flex Fabric is configured to manual log in redistribution.
    Please help.

    • evlonden

      Hello Saiyed,

      Can you clear the portlog and bounce the port once (portdisable/portenable) and show the portlogdump for that port. This might provide some clues. Be aware this might not provide all details for that login process but it would give me something to look at.

      Secondly be aware that if a base device logs out the NPIV address will also be removed. You can change this behaviour with portcfgflogilogout –enable -all command. I think that is your best bet for now.

      If that doesn’t work I would really need to look at the portlogdump or, if even that doesn’t provide a clue, a FC trace. Your support-provider may be able to help you with that.

      Regards,
      Erwin

      • saiyed

        I have reconnected the NPIV node and removed it couple of times, right now its not on this fabric, the log is during this actity.

        So you are saying enable the setting below.

        NPIV PP Limit: 126

        NPIV FLOGI Logout: OFF ( Turn this on)

        • Hello Saiyed,

          Did you check with HP if anything needs to be done here. From a Brocade perspective it seems to be OK and there is not much you can do. If an HBA (or in your case teh HP FlexFabric switch) logs into the fabric it needs to set word 1, bit 31 of the common service parameters in the FLOGI request to 1. That triggers the switch port to allow FDISC ELS request in order to address multiple ALPA’s behind the port-ID.

          This may sound a bit technical and unfortunately there is no way to see this in a switch log and you would need a FC analyser to see if this bit is turned on during the FLOGI request from the HP FlexFabric. The picture below shows the bit that needs to be 1 if NPIV is used. By default it is off.

  40. AlexeiStepanov

    however, it should be turned off by default. but if it’s there, you can always start using it!

  41. Stefan Beutler

    very good document, solved my issue with Storage Provision, where only SMI-S agent of BNA used for software to switch communication and JVM restarted all times, increasing heap size to 6GB stabilized it

  42. saiyed

    hi all, i have FOS 7.3.0c with brocade 6520 and i have tried all the mentioned settings & i am still having java issue, i have tried, JAVA 7.60, 7.71, 7.67 & 8.25 with errors such as “bad request details” i have opened a case with Brocade.
    More specific error below.
    Missing Application-Name manifest attribute for: http://192.xx.x3.xx/wt-app.jar
    log4j:WARN No appenders could be found for logger (com.brocade.wtcommon.access.AccessController).
    log4j:WARN Please initialize the log4j system properly.

    • Hello Saied,

      According to the release notes the webtools app on FOS 7.3.0 should be compatible with JRE 1.7.0 U60. Have you tried to clear the JRE cache? I’ve seen these sorts of incompatibilties pretty often and in many cases being on the correct JRE level with the correct setting and a clear cache often resolves the issue.

      I’m curious what Brocade finds out. If you get any feedback from them please let me know.

      • saiyed

        Update: issue is not resolved.

        1. reenabled the swtich after disable

        2. reran configure, please see the the output and advise if any thing should change.

        Fabric parameters (yes, y, no, n): [no] no
        Virtual Channel parameters (yes, y, no, n): [no] no
        F-Port login parameters (yes, y, no, n): [no] no
        D-Port Parameters (yes, y, no, n): [no] no
        RDP Polling Cycle(hours)[0 = Disable Polling]: (0..24) [0] 0
        Zoning Operation parameters (yes, y, no, n): [no] no
        RSCN Transmission Mode (yes, y, no, n): [no] no
        Arbitrated Loop parameters (yes, y, no, n): [no] no
        System services (yes, y, no, n): [no] yes

        Disable RLS probing (on, off): [on] on
        Portlog events enable (yes, y, no, n): [no] no
        ssl attributes (yes, y, no, n): [no] no
        rpcd attributes (yes, y, no, n): [no] no
        cfgload attributes (yes, y, no, n): [no] no
        webtools attributes (yes, y, no, n): [no] yes

        Basic User Enabled (yes, y, no, n): [no] no
        Perform License Checking and Warning (yes, y, no, n): [yes] yes
        Allow Fabric Event Collection (yes, y, no, n): [yes] yes
        Login Session Timeout (in secs): (60..432000) [7200]

        +++++++++++++++++++++++++++++++++++++++

        Found the following:

        Looks like the HTTP/HTTPS daemon in the switch had encountered errors and not pumping the data properly.

        As a workaround try to restart the switch http/https daemons this can be done by executing the following commands from CLI

        Procedure:
        execute “configure” from CLI
        select “http attributes”
        disable HTTP/HTTPS
        exit out of configure.
        execute “configure” and enable the HTTP/HTTPS services.

        PS: This just restarts the HTTP/S daemons and does not affect the switch functions in anyway.

        ++++++++
        Note: I did not find the same setting on FOS 7.3.0c and 6520
        may it has changed or burried in some other config.
        ++++++++++++++++++++++++++++++++++++++
        4. I changed the switchname to remove _ and keep it at simpler name just in case it was affecting java.
        +++++++++++++++++++++++++++++++++++++

      • saiyed

        I have sucess finally without a lot of changes.
        1. Uninstalled all Java from my desktop
        2. Reboot
        3. connected to the switch without any java, it prompted for Java 6.25 ( Strange), i allowed it to download it, then relaunched the connection, broswer complained about out of data Java and asking for 1.7+ ( it should be noted that when connecting ie it downloaded 6.25 but firefox redirected me to update the java @ 8.31-current).
        I ignored and updated from my privious download to 7.60 ( supported by brocade for this switch).
        4. Reconnected to the swich, this time it complained about out of date but i said proceed, meanwhile browser add o n popp up were going on about the enable add on with some sort of SE add on ( SSV Helvper & SSV2 plugin helper), i allowed them.
        Then after the 1.7.60 install completed, the webtools loaded.
        Been fighting this for two days, i really wish Brocade & Java fix this type of issues its giving them bad name.
        Also i am not happy that Brocade does not allow access to thier KB support site, even thought i have mybrocade log in and we own several brocade product because we buy it from 3rd party-Not a good policy in my openion

        • Hello Saiyed,

          Good to see it’s finally working. As you can imagine the issue is mainly with Java and Oracle’s release and security policies that seem to keep changing ever month. The java engine and plugins change in such a rapid pace it’s almost undoable to keep up-to-date with it. As you mentioned the Brocade app requests a certain level of the JRE but as soon the browser requests that version you first run into a browser check which flags that version as being insecure or incompatible (for whatever reason) and either blocks it entirely or it redirects you to Oracle’s Java webpage which, more or less, gives you no other option than to download and install the latest version.

          I always liked the webtools app but not for management purposes. I learned the CLI pretty quick and found that it’s not only much quicker but also more reliable adn your not depending on third party versions and related conflict you may run into.

          As for Brocade’s portal I cannot comment on that as I may have different privileges than you and I’m not in a position to make changes to their policies.

          Regards
          Erwin

  43. Duncan

    To be honest, I had no intend to down play anything. I do strongly believe though that things do not happen over night. In your article you more or less elude to his, maybe without realizing:

    “Companies like E-bay, Spotify are happily harnessing the power and flexibility of Docker. Also well-know search engines like Yandex and Baidu are using Docker for a multitude of functions.”

    The reason companies like e-bay etc are successful in deploying these types of technologies is because they have a completely different application stack then most enterprises. You want to ask yourself next why it is that virtualization has been so successful, and in my opinion that is mainly because it offered a couple of things :

    1) higher utilization of resources
    2) mobility
    3) availability

    And it did this without any disruption to the workload, no changes to the application or even OS required besides the installation of some drivers. That is the big difference with what you are describing and what companies like e-bay are doing.

    Now don’t get me wrong, I do believe the world is changing. Containers but more importantly distributed application architectures are the way of the future, but knowing how many enterprises operate and how little control they (unfortunately) have over a vast majority of their application stack I also realize that it is not realistic to expect this to change drastically within lets say 5-6 years, at least not for the portion they do not own internally. Again, yes the world of IT is changing… but hasn’t it always been?

    • evlonden

      Hi Duncan,

      Thanks for your response. I agree the success of VMware was partly contributed by your arguments but it has been mainly skyrocketing due to the lack of competition in that niche market for the entire 2000/2010 decade. That market has not really changed from a technological perspective but additional choice in hypervisor technologies from a multitude of vendors are fishing in that same pond now. Given the fact that currently the maturity of the VMware hypervisor, its contributing technologies and eco-system make it an obvious candidate on each short-list.

      The companies mentioned in Docker’s “use-cases” have the advantage of being able to target very specific applications which makes adoption easier. The same thing we see here in Australia where Openstack is very well represented in the education sector. The major threat for VMware is however the fact that vendors like Oracle and SAP are in a very comfortable position when it comes to shifting their products into containerized formats and do not longer need to take OS interoperability into account. That means much shorter dev-cycles and therefore much improved code resiliency. As soon as that happens the majority of main-stream business applications are available for current customers who run this on hardware or VM’s and that fact, combined with an even less resource hungry virtualisation stack like Docker and AWS Lambda, will certainly eat big-time into VMware’s sales-number. As I said, when push comes-to-shove in the end it’s about the $$ and if a CIO can save by squeezing even more juice out of his/hers hardware they most certainly will.

      As for time-lines it’s always hard to predict. I don’t have insight into roadmaps from SAP, Oracle and other major companies in this space. I would not be surprised though that they have been exploring these kind technologies for quite a while now and all of a sudden whack products on the market which fit in this space. Customers who have been exploring the same technology will most certainly adopt very quick.

      Again, thanks for your feedback.

      Cheers,
      Erwin

      • Duncan

        Personally I am not sure the overhead difference is that big, we are primarily talking disk space here for now and with advanced techniques like VMfork and smart linked cloning technology even that will diminish. For many enterprise companies there is no benefit in having 2 or 3 different operational strategies, it will simply not be cost effective and will be too complex. On top of that there is the whole security aspect, companies like Google run their containers in a VM for a reason.

        Anyway, this is an interesting discussion… but with your background I do wonder how you feel how this will impact the world of storage. Especially with this new highly distributed application architectures the need for expensive shared storage systems start to disappear for a large portion of the stack. How will Docker, but more so the distributed architecture, eat your lunch?

        • evlonden

          πŸ™‚ Replying whilst eating lunch… haha..

          I can’t comment on what Google is running or not. Under NDA here.

          Its fortunate enough I work for a company who’s bare-bone roots lie in large-scale compute infrastructures (mainframe compatible kit) and a huge background in storage but the main benefit is that we (as HDS) are not solely tied to compute infrastructures and related materials. As a subsidiary of Hitachi we have the benefit we tie in with a massive amount of diverse, technology related, sister companies which has opened up a huge playground of opportunities, City smart-grids, health-care, logistics, manufacturing technologies and much more in a single house which will all be tied in with the Hitachi logo stamped onto it. The onflow of these market sectors will ensure that the compute platform such as Hitachi UCP will be cemented to support these.

          My diet may shift to a reduced lunch but breakfast and dinner will be much more …. interesting. πŸ™‚

          You may want to check out https://community.hds.com/community/innovation-center

          Some sneak previews are there or will be published soon.

          Regards,
          Erwin

  44. Massimo Re Ferre'

    I agree with your analysis.

    With all respect due, there is nothing visionary here, it’s pretty much common sense.

    Innovate, monetize, lather rinse and repeat. Or die.

    What’s new?

    • evlonden

      Hello Massimo,

      Thanks for agreeing. I do however tend to disagree on your second statement though. Innovation comes from vision. You see opportunities in new markets or have the ability to create something much better than something that is already out there. Henry Ford did that with cars, Steve Jobs with phones, Bill Gates with DOS (the latter based on new markets, not an improvement in quality of products. :-))

      Again thank you for your comment.

      Cheers,
      Erwin

      • Massimo Re Ferre'

        I was referring to the vision required to underline this pattern, which is common sense.

        Obviously the vendor will need to have a vision to keep innovating.

  45. Kelvin

    Looks like there is no way to predict the compression ratio and average frame size before the compression enabled, am I correct? In that case, we should do a best effort estimation first and then run the configuration for a while and collect the actual data (compression ration and average frame size) and do the fine tune, right?

    • evlonden

      Hello Kelvin,

      Yes, that is correct. I’ve been involved an fair few cases where we had to start off with the assumption of no compression at all and work our ways towards a more realistic number. Especially in FCIP cases where WAN links are involved this may be the best method to go. It will also allow you to reduce your leased bandwidth to that number and save some money.

  46. Pomer

    Hi Erwin,

    Where did u get the official standards for the chart above regarding the fill modes? Do you have any links?

  47. Roman Sozinov

    Erwin, you say “based on the QOS zone mapping and if contention occurs the priorities of the frames being sent over the ISL is set to a 60%/30%/10% ratio. ” – is 60/30/10 ratio something what is hardcoded and not customizable? When we asked Brocade about how QOS internally works they never said about such ratio.

    I thought this ratio is valuable always and QOSH zones traffic have priority on ISLs just because VCs 10-14 have dedicated 5 buffer credits each (when QOSL zones have only two of them and QOSM can borrow credits). Now I’m bit confused.

    • evlonden

      Hello Roman,

      This is not new and has been in FOS for a long time. I’m curious why Brocade is so secretive on this. It is mentioned in their training material. Be aware this algorithm only kicks in when contention occurs and is fixed in the ASIC code. Although you have virtual channels the link itself remains a serial line and only one frame can be sent after the other. If you have multiple frames arriving at the switch from different originators which are mapped to traverse a particular ISL. The order in which they will be sent is depending on arrival however if each of these ingress buffers of the originators is filled up the order on which the frames will be sent onto this ISL is in a 60/30/10 ratio. So you will get 6 frames if the destination is in a QoSH zone, then 3 frames when it is in a QoSM zone and 1 frame on a QoSL zone. This is to prevent locking a higher QoS level any lower one and there will still be frames traversed on lower QoS values. If you have a very busy system that would be mapped on a QoSH zone it could potentially prevent all other traffic in QoSM and QoSL from being moved over an ISL. By FOS applying this ratio there is still a relatively fair queuing mechanism and shouldn’t cause issues on host. Obviously when you run in these situations it would be wise to start re-designing your infrastructure and increase the capacity on the ISL’s to accommodate for more traffic.

      Hope this explains the situation.

      Kind regards,
      Erwin

  48. Adrian Alvarado

    Hi Erwin,

    Did you have tried or know how to make work brocade web tools from mac os X ?

    Regards

  49. Yury Yavorsky

    hello Erwin! Could I ask you to get back on this once more. You tack about RSCN as the trigger for path offline. If so why it is online till real I/O or SCSI TUR will occur? This is also stated in HDLM documentation – “Without path health checking, an error cannot be detected unless an I/O
    operation is performed, because the system only checks the status of a path
    when an I/O operation is performed. With path health checking, however, the
    system can check the status of all online paths at regular intervals regardless
    of whether I/Os operations are being performed.”

    • evlonden

      Hello Yury,

      When things go wrong in a fabric there is no guarantee that the RSCN will reach all ports in the fabric. The TUR or the real IO request is then a fallback method for MPIO to take the path offline. If the RSCN has reached the HBA or storage port you can safely assume that path will be taken offline immediately.

      Hope this helps,

      Regards,
      Erwin

      • Yury Yavorsky

        One situation when RSCN will not reach the port is that link itself from the HBA port to switch is broken. Could you tell me about another situations? Suppose in normal environment RSCN should work fine for fabric events that the particular port is registered for (SCR Value).

        • evlonden

          Hi Yury,

          Indeed. There has been a defect in FOS which inadvertently modified the SCR value in the nameserver registration. This caused RSCN’s not to be sent to the device in question. Obviously the same thing can happen when a device has a firmware/driver bug and not properly handles the SCR registration or does not act properly upon a reception of an RSCN. These are however corner cases.

      • Roman Sozinov

        Erwin,

        In our environment (RHEL 5.10 with Native MPIO) we have the situation when host gets RSCN (when storage port is offline), but does not block path to that storage port immediately.

        It waits for dev_loss_tmo seconds (or other scsi timeout value based on scsi error handler) instead.

        Do you want to say that something definitely wrong in our configuration and if RSCN has been got MPIO has to block path to that storage port immediately (without waiting for any scsi eh timeouts)?

        • evlonden

          Hello Roman,

          No, your environment is behaving as expected. There are a number of reasons why a path is not market as failed immediately because an RSCN may be sent due to a number of reasons, a failed link is only one of them. There are two timers who control this. One you already mentioned and the other is fast_io_fail_tmo. Please see the documentation on when and how to use these.

          Hope this helps.

          Kind regards,
          Erwin

          • Roman Sozinov

            Thanks Erwin for reply.
            I understand that RSCN may be sent due to a number of reasons.
            But let’s take a look storage port offline case – shouldn’t MPIO block failed paths immediately right after it gets RSCN about storage port is offline?

          • evlonden

            πŸ™‚ There is no RSCN for a “Port-Offline”. The RSCN can contain a few options one of which is “Changed Name Server object”. This is the RSCN if any of the ports where the HBA is zoned to does change its state in any way. Doesn’t necessarily have to be “Offline”. It is then up to the recipient of the RSCN to determine what has changed. This might take a while. There is also an RSCN which is “REMOVED OBJECT” but I’m not 100% if that one is used currently. It depends on the RSCN event qualifier being used by FOS. That in turn depends on the SCN registration the recipient has sent to the fabric controller upon NS registration.

          • Roman Sozinov

            Thanks again for clarification πŸ™‚
            Just trying to identify what level (fabric, HBA driver or MPIO) is responsible for latency problem we have in case of storage port failure.

  50. Hello Siddarth,

    Thank you for your comment.

  51. Thank Gavin for the response. As I mentioned above I’m not 100% sure how the calculation is done. I think its a bit too over-simplified to say you get 16000 IOPS out of a subsystem and divide that number by the sum of spindles in the system. I also mentioned in my post that by utilizing every surface of a spindle does not improve performance perse. It will most likely decrease if not handled correctly due to increased delay caused by the head-switch time which incurs an additional settling time.

    If you can elaborate a bit more that would be very appreciated. As far as I can determine now per your comments, the proclaimed number of 400 IOPS per spindle is only substantiated in an overall 40 spindle subsystem.

  52. alpharob

    “Obviously there are some additional smarts which is handled by the drive
    firmware and buffer-space which can adjust and re-order the read- and
    write commands to optimize some of this.”

    Some or all of this? TCQ and analogous NCQ changes all of this. Tracks will switch and get that nearby read, never mind it came on to the queue much later.. as you know. Maybe a guess of the firmware that Richie and crew came up with is to move the hotter blocks to faster tracks. I seriously doubt they improved on TCQ/NCQ algorithms. But who knows.

  53. Yury Yavorsky

    Hello Erwin! Thanx a lot for this post. Just one question regarding CS_CTL values with Default mode. As I see ranges 1-8, 9-16 and 17-24 are not correspond to the number of VCs. How switch will interpret this? As kind of priorities? Thus will it allow the bigger percentage of High Priority VCs usage for frames with value 24 than for those with value 17?

  54. Yury Yavorsky

    Hello Erwin! Thanx a lot for this post. Just one question regarding CS_CTL values with Default mode. As I see ranges 1-8, 9-16 and 17-24 are not correspond to the number of VCs. How switch will interpret this?

    • Each of these values are mapped to the LOW, MED and HIGH VC groups in a round-robin fashion as far as I’ve been able to determine. Be aware that these values are defined in the RFC’s I mentioned. I wouldn’t be surprised if the QoS functionality from the CS_CTL field are extended to FCIP configurations where these DSCP values can be used to propagate these QoS priorities in the IP WAN networks.

      Cheers

      • Yury Yavorsky

        I extended my question with assumption of some kind of priority. This will win with CS_CTL not only from the fact that it works with the particular LUN, but have a much higher level of granularity in the distribution of traffic.

        • evlonden

          As opposed to the IP layer capabilities where there are more options to choose from obviously. On the Brocade side we’re restricted to 3 QoS levels hence why they grouped these values into 3 sections. There is no further prioritization within these groups.

  55. Michael

    Hello Erwin,

    thanks for the guide.
    One question regarding the system.cpuload setting. Is changing it to 10 still true for FOS v7.x or other fixed versions referenced in TSB 068 and 111?
    What I got from Brocade is that in any version containing the fix modifying the cpuload setting is not needed. Do you have more information about that?

    Best regards
    Michael

    • evlonden

      Hi Michael,

      Indeed there has been improved logic in later code releases however I’ve still seen issues hitting the i2c bus (the same bus that is used when polling the sfp’s). By setting these values you can still prevent a process from hogging that bus. No guarantees if Brocade will hold on to this setting and values. The problem is most apparent when two or more management applications are polling the switches for info. I’ve seen one environment where 5 management applications plus two scripts where constantly polling the switches. This customer had run into an issue and wanted to be notified on multiple levels via various methods. Obviously the way they configured and used it was well beyond what a switch CPU could handle and they ran into even more issues. By killing off all but one management tool and reducing the polling frequency they went back to a more normal state. That management application was then configured via snmp and syslog forwarding to notify these other apps.

      Hope this helps.
      Regards,
      Erwin

  56. Yury Yavorsky

    maybe little off topic but anyway could you provide information of CS_CTL support on HDS side. Is it supported? If so then for which sequence in exchange – inbound, outbound?

  57. jaymike

    This is great material. The only thing I might do differently is matching up the switchnames to the domain IDs. I usually advise to give the switch a name that helps correlate it to the domain. That way they (and us support people) don’t have to remember

    SW01 = 225 (e1)
    SW02 = 193 (c1)
    SW03 = 226 (e2)

    I might go with something like SW-225, SW-193 and SW-226. Granted, you can see this mapping in ‘fabricshow’ (and it could complicate domain ID changes), but I just like knowing at a glance which domain I’m dealing with and it might help the customer make sure they’re sending in data from the right switch.

    Thanks for all of these posts.

    • Hello Mike,

      Thanks for your comments. Obviously you can use switchnames as to your liking. The reason I use the hex-values of the domain id’s is that I can more easily map each individual FCID to the respective switches/domains. In my case if i have an HBA connected to port 10 on switch 01 that FCID will become E10A00 so that shows me at a single view what I’m looking at. Also for scripting purposes this is very handy if you extract information out of a supportshow and use scripting tools who can calculate hex to dec values and reverse. I always used, or tried to, machined based information that is not susceptible to human interpretation. This prevents confusion if someone needs to take over. Surely there are more options and it’s up to the person who administers the environment what he/she want to use.

      Regards,
      Erwin

  58. vikranth

    Hi Erwin van Londen,
    Well, i have an issue regarding Brocade FOS logging into the IE. Java 7 u 67 blocks and issues a Failed Certificate.
    I tried all the stuff which has been talking around all the conversation held above by editing in Notepad++ and i cannot downgrade java as well due to environment standards, which should only run in JRE 7 u 67. But it doesn’t help me out.

    Is there any other solution with you that can help me out from this issue

    Thanks

    • Hello Vikranth,

      There have been some changes in both FOS and JRE. When you look at the release notes of FOS 7.3 in the “Webtools” section on page 47 it’ll show you additional options you can try. I’ll add this to this post as well.

      Still 1.7.0u67 is not on the supported JRE list even for FOS 7.3. The latest supported seems to be 1.7.0u60 but that should be close enough.

  59. β€œ: Brocade Network Advisor – Java memory size. – http://t.co/uuQF1Elfmj” < Great hint!

  60. @landonnoll People still use Fibre Channel?

  61. gcharriere

    Hi Erwin,

    Another solution could be TI zones… Maybe in your next post?

    Regards,
    Gael

    • Hi Gael,

      From a technical flow control perspective this would have been an option however I also wanted to segregate the functional part so that from an administrative point of view there was much more flexibility and less room for error. The concept of TI zones (which in essence is just a routing adjustment) may become even dangerous if it not set-up and administered correctly.

      Food for thought indeed. πŸ™‚

      Regards,
      Erwin

  62. AlexeiStepanov

    Hi Erwin, thanks for the excellent reading, even better than always.

    Just as a side note about the D-Port functionality. Very good stuff, but still needs some … “tuning”. 1) I’ve seen D-Port reporting “PASSED” while some very bad counters increasing at one of the ends of the wire. 2) Occasional disabling of the D-Port might lead to the reset of the entire blade.

    • In theory that shouldn’t be happening however I’m not surprised since the “ClearLink” technology that Brocade developed is still in its infancy. Qlogic, after taking on the Brocade HBA business, will also further develop the ClearLink technology so I think within one or two years they will have an excellent understanding on how it all pans out. There haven’t been many cases around yet so both companies rely on their internal QA departments to figure out stuff. As soon more field experience is gained you see an improvement in code and use-cases. As for the blade-failure in some cases this is expected. If multiple ports on a back-end link observe errors FOS may fault a blade. Remember that front-end port-errors might be propagated to the back-end especially when Link-resets and/or CRC errors occur. Basically the name of the game is you have to make sure the physical side of the infrastructure is in 100% tip-top shape before taking it into production.

      Thanks for your response.
      Cheers, Erwin

  63. jpm

    Hi Erwin,
    I had this annoying problem running and older FOS (6.4.2b) and a new java (1.7.0_55-b13) on my Windows 8.1 x64 laptop.
    The solution is as described (changing the line jdk.certpath.disabledAlgorithms=MD2, RSA keySize < 1024 to jdk.certpath.disabledAlgorithms=MD2, RSA keySize < 256) in the java.security file, however when running a 64bit system this file exists in TWO locations:
    c:\program files\java\jre7\lib\security
    c:\program files (x86)\java\jre7\lib\security
    Be sure to make the corrections in both files that worked for me πŸ™‚
    /John

    • Hi John,

      Good tip. Thanks. I think it depends on which JRE you have installed on your system. I haven’t tested this but it could be that the 64-bit version can run in some sort of 32-bit compatibility mode which then uses some other config files and libraries.

      Cheers,
      Erwin

  64. While I am not certain why you were seeing a rule to alert on anything more than 0% Memory Usage, the default in BNA is to alert above a threshold of 75% memory utilization. I believe you will agree this is a reasonable default threshold value.

    • Hello Andre,

      I don’t know this either. I assume the administrator has been playing around with MAPS and inadvertently enabled an incorrect policy or created one of his own with the incorrect settings. This is however not what I wanted to emphasize with the article. What I wanted to do is highlight the fact that even when such a large amount of events are being logged, for obvious reasons, it seems no-one seems to inclined to adjust the rule-set. That, for me, is taking care of business and actively managing such an environment. Somebody can make a mistake by creating and enabling such a weird rule but as soon as the event-log starts to fill up he, or she, should be correcting this.

      Thanks for you response.

      Regards,
      Erwin

  65. gcharriere

    Hi Erwin,
    I discovered as well some unexpected behavior with the conversion tool from FW to MAPS. I would advice cloning one of the default policy and then do the modifications from this customized policy:
    mapspolicy –clone dflt_moderate_policy -name my_policy

    You can see below the default log threshold for CPU (80%) and Memory(75%) with the default moderate policy:
    mapspolicy –show my_policy
    defCHASSISMEMORY_USAGE_75 RASLOG,SNMP,EMAIL CHASSIS(MEMORY_USAGE/NONE>=75)
    defCHASSISCPU_80 RASLOG,SNMP,EMAIL CHASSIS(CPU/NONE>=80)

    If you decide to build your customized policy from one of the default ones, please take care at the fencing thresholds. Fencing is activated for almost all error counters with the default policies. This is a Brocade best practice that I would advice as well. However a lot of customers are afraid of such behavior. This is at least worth to be aware of it.

    Regards,
    Gael

    • Hello Gael,

      Yes, port-fencing is one of the best features that were included in FabricWatch a while ago. I even made a short video on it. The thing that staggers me is that very few customers are utilizing this excellent feature. As you’ve seen in my “Rotten apples” series only one single broken link can have a devastating effect on an entire fabric. Preventing this from having further ramifications is a massive gain in reliability. I still don’t understand why admins don’t use this feature. Last week I had an example where an entire fabric went haywire because of an ISL-port having synchronization issues. This caused the entire fabric to almost come to a stand-still because of the constant re-configuring and normal traffic was stopped. A simple MAPS or FabricWatch rule could have prevented this.

      Thanks for your feedback.

      Regards,
      Erwin

  66. jchiodo

    Dropping the value from ‘ <1024 ' to ' < 256 ', by itself, did not work for us here on a 48K at FOS 6.4.1b with Java 1.7.0_51 on the host. But doing that AND changing the java control panel security tab from HIGH to Medium did the trick nicely!

    • Hello J,

      Thanks for providing the info. Did you upgrade FOS afterwards on the 48K? I think the changing the security from High to Medium relieved the restriction of having no- or self-signed certificates. Not 100% sure that seems very often the case.

      Once again thanks for the tip.

      regards,
      Erwin

  67. gcharriere

    Dear Erwin, could you please clarify what happens when frames tagged by a Cisco Equipment on one side are received by a Brocade device on the other side. To be more precise, I am referring about the interconnection between UCS (FEX) and a Brocade device. By default, the FEX will tag traffic in VSAN 1. It means that somehow this tag should be included in the extension header. In this case, what happens when the brocade switch receives it. He simply does not take it into account and accept the traffic?

    I was under the impression that VSAN were working the same way than VLAN. With VLAN,when a port is in access mode or untagged more, the frame is sent without 802.1q header at all. However, with VSAN, I did some tests that made me think that it behaves differently. When we connect a FEX in VSAN 1 to a MDS switch, there is communication only if the VSAN 1 is configured on the other side as well. It means that the VSAN ID is included in the header. However when we connect it to a Brocade switch that is not able to “speak” VSAN, there is still communication. Any idea how?

    Thanks in advance.

    • Hello Gael,

      I think the communication is limited to the ELP and EFP parameters. During initialization of the ISL the ELP (Exchange Link Parameters) checks whats on the other side. If its a switch it will pop into E-port mode and it resumes with EFP (Exchange Fabric Parameters) negotiation. It might well be that the ISL itself if fully operational but the fabrics will segment. I think a Cisco FEX sits in NPV mode by default and thus is unaware of any VSAN configuration.

      Secondly don’t get mixed up with VLAN and VSAN. These are totally different and have nothing in common. Fibre-Channel N-ports do not “tag” frames per-se. There is only field they might use (CS_CTL) which is an obsolete byte in the frame header which was used in Class 1 and Class 4. I’ve personally never seen this byte being used. An N-port NEVER sets an extended or routing header.

      Hope this helps.

      Regards,
      Erwin

      • gcharriere

        Thanks Erwin. Cisco FEX sits in NPV mode by default, but most of the time, f-port-trunking feature is used as well. And in this case the traffic from the FEX goes into the switch with the VSAN tag. It is still not clear for me how a Brocade switch will process a frame with a VSAN tag from a Cisco equipment.

        According to the white paper from Cisco (http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-b-series-blade-servers/whitepaper_C07-730016.pdf), the tag is simply ignored by the
        Brocade device, but there is no details about the way it works.

        Regards,
        Gael

        • Hi Gael,

          There is no VSAN identifier between the FEX and the Brocade switch. The vHBA, VIC and FEX determine to which VSAN they belong but this is not propagated further. You can compare it a bit like portgroups in a Brocade terminology where the F-ports on the Access Gateway (NPIV switch) can be grouped to use one or more N-ports going to the rest of the fabric. These portgroups are also not propagated to the rest of the fabric. With the terminology differences across these two platforms it doesn’t make it easier but I hope I explained it a bit here.

          Cheers
          Erwin

  68. I use firefox and some nights at 12:00 exactly or 12:30 exactly, firefox stops working and does not load a page. The internet connection is perfectly fine though because it says excellent. I am connected to a home router and the signal strength is always excellent. I do not know why this happens at exactly 12:00 or 12:30 on my desktop time but it’s a pain. I’ve tried “ping”ing and everything looks fine..

  69. javier

    Hello Erwin, I have got a brocade 6510 connect to a PC with SUSE linux.When I try connect to switch after that I write login and password the webtool dosn’t allow neither use the tool bar nor progress to other window.
    Linux suse has a java.security file but I don’t find “jdk.certpath.disabledAlgorithms=MD2, DSA, RSA keySize < 1024" sentence.
    and then I can't apply it.
    Thanks,

    • Hello Javier,

      There are a couple of options. I don’t have a SUSE instance at my disposal right now but my Fedora system runs both the OpenJDK as well as the Oracle Java JRE.
      [1354][erwin@monster:~]$ locate java.security
      /usr/java/jre1.7.0_45/lib/security/java.security
      /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.60-2.4.5.1.fc20.x86_64/jre/lib/security/java.security

      When looking at the Oracle “java.security” it shows on line 404:
      jdk.certpath.disabledAlgorithms=MD2, RSA keySize < 1024

      The same line is seen on the OpenJDK version. It may be that SUSE has packaged the RPM differently so the locations might not be the same.

      You did not mention the FOS version running on the switch. Maybe its not supported for that particular JRE version. I know Brocade have put some extra restrictions in later release notes stating that the 1.7.0_45 JRE is not supported and I do think they don't support the OpenJDK at all.

      Hope this helps a bit.

      Cheers,
      Erwin

      • javier

        Thanks Erwin for you help, We think the problem is the java version of mozilla explorer. It’s a linux suse version 1.6.0. (IBM developed kit for linux) The switch has a 7.2.0a OS version.

        However We can connect to switch with the same version 1.6.0 in a windows PC.
        To sum up, the problem is in Linux java version but We don’t know why.
        Regards
        Javier

        • Hello Javier,

          It’s a bit hard to troubleshoot why it doesn’t work. There can be many reasons. I’ve found that when your able to get a JRE console running it might give you an idea either where it stops or what causes the issue. To be honest I’ve often used a trial and error method to dig through some of these issues. I also must confess that I haven’t used the Brocade webtools for a long time. The CLI has always worked better and faster for me.

          Cheers,
          Erwin

  70. AlexeiStepanov

    Hi Erwin, thanks, that’s a good reading!

    Clock sync is really essential when troubleshooting a complex issue.

    I once had to help people finding the problem, where each component was supported by a different support partner: AIX host, HP disk array, EMC multipathing and CISCO SAN – what a nightmare environment to build and maintain! And BTW everything was hooked to NTP, except disk array SVP, but its time difference (like 57 minutes or something like this) was already carefully recorded. So the email thread when I’ve got it looked like this:

    IBM: ERRPT shows FC errors and resets at 7PM. What do you see at this time?
    EMC: yes we see this in the PowerPath logs.
    HP: well we see some SCSI reserve conflicts, but we cant tell why they happen…
    CISCO: we don’t see anything at all!!!
    Others: ??????

    It was the first in my life troubleshooting exercise with CISCO and I was surprised to discover that they log everything using GMT, so SAN guys were looking at the much later point of time in fact… Also, at the end of the end, CISCO wasn’t found guilty πŸ˜‰

  71. rlewin

    I am having the exact same problem that brjaiswal has. I’ve also tried the same things with the same results. In other words, I am not able to run the switch explorer after the install of java 7 update 51.

  72. AlexeiStepanov

    so let’s pretend we are interested in excavators πŸ™‚

  73. jaymike

    Good tip to share. I’ve seen this burn customers before too who were trying to merge fabrics. Merges failing due to zoning “conflict” even though one side was already cleared with cfgclear. If that defzone is set to NoAccess, that hidden dummy cfg can prevent the fabrics from merging.

  74. kasperottesen

    Can this be changed non-disruptively if you have zoning on and have an active zone configuration?

  75. Martin2341

    Hi Erwin,

    Thank you for putting this together. I believe it is very helpful information.

    • Hi Martin,

      Thanks for the images. I’ll add them ASAP.

      • Eric Chen

        Above information is wrong.
        correct info is:
        it is either (Green β€œE” sticker+forward+arrow down) or ( Orange sticker+reverse+arrow up).

        We got wrong part from EMC and just tested it.

        • Hi Eric,

          Which partnumber did you order and how was it working? Which sticker was on it and how did the chassisshow output display it?

          As I mentioned the discrepancy was more in the way FOS did display the airflow in the chassisshow output. The stickers (I and E) were good.

          • Eric Chen

            This is the one works for us, air flows from power cord side to fiber cable/SFP side.

            switchname-xx:admin> historyshow

            FAN Unit 1 Inserted at Tue Nov 4 21:52:55 2014

            Factory Part Number: Not Available

            Factory Serial Number: Not Available

            switchname-xx:admin> chassisshow

            FAN Unit: 1

            Fan Direction: Reverse

            Time Awake: 3 days

            FAN Unit: 2

            Fan Direction: Reverse

            Time Awake: 3 days

            POWER SUPPLY Unit: 1

            Power Source: AC

            Time Awake: 3 days

            POWER SUPPLY Unit: 2

            Power Source: AC

            Time Awake: 3 days

            CHASSIS/WWN Unit: 1

            Header Version: 2

            Power Usage (Watts): -67

            Factory Part Num: 40-1000569-13

            Factory Serial Num: BRW25XXXXXX

            Manufacture: Day: 13 Month: 2 Year: 2014

            Update: Day: 12 Month: 10 Year: 2015

            Time Alive: 335 days

            Time Awake: 3 days

            ID: EMC0000CA

            Part Num: CONTRX0000651

            Serial Num: BRCBRXXXXXX

  76. keivan_mh

    THANKSSSSS – I don’t know how to thank you, I couldn’t find a solution almost anywhere on the net, this worked for me straight away.

    Just a hint that Windows users need to modify that “java.security” using Notepad++ or similar app. Normal Notepad will NOT work.

    • Thanks Keivan, glad it worked for you. Yes, notepad++, UltraEdit or VI would be the best options to modify any sort of text document. These do know about the intricacies between the cr/lf handling between posix and non-posix systems.

  77. saiyed

    I have tried all the tricks listed here and still i cannot get both versions of java to operate on my laptop.
    I need Java 6 & 7, i have updated/downgrated to 7.25 without any luck, the newer swtiches work with Java 7 ( Brocade 7800)
    but the older one with FOS 6.* 5300,300 dont.

    • Yes, I know it’s a really annoying issue especially if you have different FOS code versions spread across your fabric. To be fairly honest, I’ve never been a huge fan of the webtools feature and Brocade could have done better without it. The CLI is much more powerful. FOS itself would have become more stable as well but hey, I’m no paying customer. πŸ™‚

  78. brjaiswal

    Hi Erwin,

    I have changed RSA keySize < 1024 to 256 but I am still not able to access brocade 5300 web tools. I have v7.2.0b on switch and java 7 update 51 in my laptop. I also tried to comment out "jdk.certpath.disabledAlgorithms=MD2, DSA, RSA keySize < 1024" but not working. Please suggest any solution.

    Thanks

    • Hello,

      When I look at the release notes this FOS version only supports JRE 1.7.0_25.

      WebTools Compatibility
      FOS v7.2 is qualified and supported only with Oracle JRE 1.7.0 update 25.

      If you downgrade your JRE it is likely to work.

      Regards Erwin

  79. It helped me.

    The cache-clearing thing really works around the bug.

    Thank you…

  80. AlexeiStepanov

    at the same time, our storage tools require no more than 1.7.40… what a mess!

    • Hello Alexey,

      You can install more then one JRE on a system. It is then up to the java applet to discover the correct version and use this accordingly. Not 100% sure if either Brocade, HDS, HP, IBM, Cisco etc…. use this accrosss their software portfolio.

  81. Sozinov

    What can say about AoE (ATA over Ethernet) developed by Coraid?
    It looks like pure ethernet without any FC layers, instead of FCoE (which is not ethernet in reality πŸ™‚ ).

  82. You raise some excellent points, as usual. Obviously, I can’t speak to all vendor implementations on the planet πŸ™‚ because ultimately what it boils down to is whether or not vendors err on the side of caution. (Full disclosure to your readers I am a PM for storage for Cisco’s Nexus switch product line).

    The FCoE frame size is fixed as per the standard, so with respect to running a mix of larger and smaller frames on the same lossless Class of Service (COS), this would require the additional placement of lossless frames (RoCE?, iSCSI frames?) into that same COS. I know, though, that when FCoE is configured for lossless COS no other traffic type is allowed on that same COS.

    You raise an excellent point about other types of lossless frame sizes, however. I’ve always been highly suspicious about the efficacy of putting iSCSI into a lossless environment, but your point about different sizes is particularly apt if people start playing around with 9k sizes.

    Your point about needing to resend a PAUSE frame to prevent resumption is something that has always puzzled me – it’s actually built into the standard spec. I always thought it would be more appropriate to have an UNPAUSE frame sent back instead, but I’m sure that was discussed before I became involved in the standard body. Next time I go I’ll ask, and maybe I’ll have a logical reason to share. (Whether we agree it’s the correct thing to do or not is a different story πŸ™‚

    Realistically, however, the key factor here is one of distance, as often the HWM is hard-coded at the maximum allowable threshhold for potential disruption; it’s not a dynamic setting. In other words, we take the worst-case scenario and use that conservative limit as the starting point.

    Having said that, of course, the behavior you describe – while certainly possible – doesn’t appear to be observed in practice. Do you have test results (or production environments) that have experienced this? I have to confess that over the past 6 years of doing FCoE (from pre-beta days) I’ve never actually seen this occur, personally. I would be very interested to see or hear anything about this in real environments.

    Best,
    J

    • Hi J,

      From what I know the MAXIMUM frame-size is fixed. That does mean that if the payload is significantly less than the max net payload of 2048 bytes for FC (which around 98% of all frames are) you could run into the issue I described.

      W.r.t. the UNPAUSE (or PAUSE_0), usages indeed depend on implementation. If the “timed-pause” is used whereby the CEV depicts the COS and one of the Pause_Time field regulates the pause period you could create some granularity however this still needs to be tightly controlled. The problem with receiver controlled flow-management is always that you never know whats coming. On short distances this obviously is far less likely to occur and it may be the reason you haven’t seen it. It may become even more of a problem in environments where congestion comes into play (on a single COS) since that is when the PFC starts kicking in.

      I don’t have an environment to my disposal where I can create such scenario’s so I’m not able to provide you with an example.

      Thanks for posting your comments. Always great to discuss technicalities with people who know stuff.

      Regards,
      E

  83. Hey Erwin,

    Great analogy. One quick addendum regarding your water basin analogy for Ethernet. Your water metaphor is appropriate, because there is a “high water mark” in the ingress buffers on ASICs capable for lossless traffic. This is calculated based upon a maximum distance by which the PAUSE frame can be sent back to the source in order for all the water in the pipe to be satisfactorily stored in memory without discarding frames.

    This is why distance becomes somewhat of an issue, as you know. At the risk of piggy-backing on your well-written article, your readers may find some additional information useful as well: http://blogs.cisco.com/datacenter/storage-distance-by-protocol-part-iv-fibre-channel-over-ethernet-fcoe/

    Again, really well done.

    J

    • Hi J, thanks for the compliment. As for piggy-backing I think I’m going to do the same. πŸ™‚ In you article you describe the receiving ASIC maintaining a High Water Mark (HWM) after which it will send a PAUSE. The problem still is that the receiving side has no clue about whats coming w.r.t. the size of the frame. It can be a full FCoE frame of 2180 bytes but it can also be much smaller which, evidently, decreases the effective utilization of the link significantly. This is also true for pretty spiky workloads. You may run into the chance that a PAUSE is sent but the TX side did not transfer anything right after the burst. This basically means that you do have a TX side sitting IDLE, plus an empty link for too long whilst the RX side is still working on getting below the HWM before it will send a “UNPAUSE”. It then still requires the UNPAUSE to arrive at the TX side plus the transfer delay of the long-distance link before effective use of the link is re-established. As mentioned, if this is a spikey workload you might very well run into performance issues. I dare to say this is imminent. The second issue this will incur is that both initiators and targets are designed and build with fairly tight timings in which they need to process IO’s. Whether this is on the array regarding onload/offload to cache and disk or at the server side on the OS stack. If HWM’s are reached in the middle of a FC exchange and FC sequences are on hold for too long because of the above mentioned problem you will see that significant delays will happen plus secondary timers like E_D_TOV or HOLD time are reached. This then has the ramification that entire IO’s will need to be re-transmitted.

      I guess there is not a golden answer to every problem and that each solution should be carefully weighed for pro’s and cons.

      Cheers buddy.
      Hope to see you soon Down Under. πŸ™‚

  84. Hello Biju,

    Some arrays do allow the initiator to log in even though the WWN is not registered. This can imply a security risk since it will also allow WWN spoofing and unauthorised access to LUNs with all nasty issues related.

    I don’t have an EVA to my disposal so I cannot analyse a trace to see what its behaviour is. There are pros and cons to both options. It depends on the choices the vendors make.

    Hope this explains it.

  85. bijucyborg

    I highly recommend Port Fencing for ISL’s. I once (2009) encountered a strange phenomenon although I guess can be explained by several contributing factors. We had a IBM AIX host with some 40TB storage assigned to it and it was capable of some heavy IO. So we had 4 end to end connections, 4 HBA’s zoned to 4 USP-V ports. One of the paths went down and the application timed out.

    The reason the path went down was because it encountered a flapping ISL enroute and the resultant IO errors caused the path down. But because this was not a direct offline sequence the RCA determined that the IBM SD driver sent messages to the HDLM driver which were not understood and as a result the path was taken offline a bit late, resulting in a app crash.

    If I’m not mistaken such flapping ports also cause a drain in buffer credits. It would be nice if you could write a piece of Flapping ISL’s, the effects and the cure.

  86. bijucyborg

    so how is the EVA different, customer claims that he used to
    1. Zone the host to the port so that the WWPNs are visible on the port.
    2. Mask the LUNs using the visible WWPN
    At least the claim is that the LUNs were visible right after, no reboots or flapping paths.
    As one of my colleagues pointed out, is it because EVA has the concept of a LUN0 visible to all hosts. Some sort of an access device which allows every PLOGI to be registered??

  87. I tend to agree with pretty much everything that has
    been composed in β€œStorage & Beyond | by Erwin van Londen”.
    Thanks a lot for all the actual details.Regards,Damon

  88. I must admit that have trouble understanding your question. Can you be more specific?

  89. Hi Ron,

    Use the recommended values of 220 to 250 on the edges and 500 in the core. If you can Condor2 based switches in the core (DCX/DCX4S) try to separate E-ports from F-ports onto different ASICs. If your edges contain TARGET ports (arrays, tape devices etc.) do not set a low value here.

    The problem is that it’s a very grey area with touches on performance and quality of links, drivers, firmware and hardware including the age of the equipment. You can have a perfectly well functioning HBA from 2002 which does a great job for the standards that were valid at that time but these days it could be a potential bottleneck.

    Yes, the (0) is the default value which will incur the values a I depicted above.

    Hope this helps.

    Cheers,
    E

  90. i have some question which i have face that is !
    Where all my available storage in the enterprise?
    Why is it available?

  91. Ron

    Hi Erwin,

    I just checked, EHT on both core and edge are showing as 0(default?? on this version). we are at 6.4.1 with an upgrade scheduled in two weeks. what would you recommend it changed it to?

    Thanks for your help.

    Ron

  92. Be aware this box has been developed with some serious future proofing installed (as they did with the MDS95xx series). I wouldn’t be surprised that a fully populated box with 40GE and 32G FC ports (albeit on a somewhat lower portcount) would also be non-blocking and hence you would need the switching bandwidth they’ve build into it now.

  93. Anonymous

    Something I don’t get in their math:

    First, 384 ports @ 16G gives 6.144 Tbps duplex or (okay, okay) 12.288 Tbps “marketing” if you count each direction separately.

    Second, 256 Gbps per “fabric” with max 6 modules gives 1.536 Tbps total, which in my book of references means 4:1 chassis oversubscription when compared to the above.

    None of it comes close to the “24-Tbps FC Switching Bandwidth” advertised in the product sheet.

    What am I missing?

    Linar

  94. It’s even more complex with different FOS versions, chipsets, if the “configure” command has been run or not. Quite some potential bombshells here.

    It seems Brocade is prepping a whitepaper on it so we’ll see. Meanwhile the entire frame-timeout scenario becomes even more difficult to troubleshoot.

  95. seb

    Wow, I was not aware of that. I also found no statement in the release notes. But our switches clearly show the default of 220ms for the EHT. Thanks for the hint, Erwin!

  96. Hello Raj,

    The feature you mention are IP based technologies. Only VLAN’s could be an option on the Ethernet layer.

    Then again, do you want to burn such a massive number of VLAN’s for initiator/target segregation? Network admins already think the current limitation is hampering their network growth. All in all it’s a bad idea.

    But that’s just my opinion. πŸ™‚

    Regards,
    Erwin

  97. Hi Erwin,

    You can use VLANS similar to that of FC Zones.
    Use MACADDRESS of the Initiator for LUN masking on the target..
    You can implement end-to-end encryption something similar to SSL/TLS. Exchange a session key on discovering the LUN by the initiator and use it for encryption/decryption for the life of that session..
    What do you think?

  98. Yes, Brocade publishes a separate guide for FabricWatch. Be aware this is a licensed feature. You can download this from the Brocade website if you registered for a MyBrocade account.

  99. hi

    Is there a guide where or how to configure this Fabric Watch feature? I have dos B40 Switch and I would like to take advantage of this feature

    thanks a lot

  100. Thank you so much for sharing this storage 2013 and beyond. It is very important to us because we need to protect our things or files. And these kinds of tips are very informative. Keep on sharing!

    Learn more about Storage Get Information Here

  101. SNW is also quite well established in Europe – 2013 will be in Frankfurt – 29-30 October – mark the date if you’re in storage in Europe. http://www.snweurope.com

  102. Thanks Erwin. I read your post with interest. I would also add that without SNIA there would be no independent global storage education. This is particularly important in the regions that I deal with such as South Asia, China and India where there is always a huge demand for high quality education in storage technology.

  103. one can use vlan tagging to separate the traffic at level2. The switches normally won’t pass packets tagged for vlan A to vlan B

  104. Hi Seb,

    Indeed TUR is one option, a read-sector-0 another. Problem is that if only 0.5% of frames are affected you still have 99.5% chance of TUR/RS-0 to succeed. Secondly if one of these TUR/RS-0 are going to lala-land there will always be a re-try of that same command which also has a 99.5% of succeeding. As you said this is not a very reliable way to determine and end-to-end path status.

    I’m working with some folks in T11 to get this addressed but it might take a while since some new frames need to be introduced. These protocol changes are always cumbersome and even if they are rectified we still need the vendors to implement it. Still might take a few years.

    Cheers,
    Erwin

  105. Good article! It’s a good wake-up call for admins relying on an assumed “intelligence” in the multipather to save them not only from hard connectivity problems but also from “soft” problems like much longer latency or frame drops. Indeed there is an additional mechanism (often called “Heartbeat” or “Healthcheck”) in many multipathers. It sends TURs (SCSI Test Unit Ready) repeatedly and could for example set the path to dead if the TUR fails. But in the field, if you have a performance problem, often only a small number of frames are really dropped – although this “small number” would be a pain, too, as it will eventually lead to error recovery with all its nasty timeouts. If the TUR was not affected, the multipather just doesn’t recognize it. And even when it hits a TUR and the path is seen as “not reliable anymore” – just some successfully completed TURs later it will be used again although the performance problem is still there. So unfortunately TURs are not the solution of the problem.
    In the mainframe world there are some new mechanisms to cope with the problem as Brocade’s Dr. Guendert explains here: http://community.brocade.com/community/brocadeblogs/mainframe/blog/2012/09/14/the-ibm-zenterprise-ec12-brocade-dcx-8510-sportscar-tire-extraploation

    Let’s see what the open world guys come up with.

  106. Hi Seb,

    Thanks for your comments.

    Indeed you need to take care on which classes and entities you enable this portfencing and as such instructions from support people such as yourself and I should be followed. The thing I wanted to emphasize is that administrators and operational management should strive to use this feature as a measure to prevent problems and not wait for such problems to happen after which significant downtime and possible recovery efforts have to be undertaken to get the infrastructure back on track again.

  107. seb

    When it comes to portfencing I think it’s worth to mention that you can configure it to trigger for C3 discards due to timeout on the TX side. If you do that only for F-Ports, you should be able to catch hard slow drain device ports. Don’t do this for E-ports, because due to backpressure there could be a trail of dropped frames through the whole fabric and you would fence a lot of “affected-but-not-the-culprit”-ports. Plus: If you enable portfencing also against other “areas” like PE (protocol errors), set the correct fillword first. If you use fillword 0 (IDLE/IDLE) for an 8G connection, the link might come up (because the link initialization uses IDLE and is okay then), but then the switch interprets the incoming ARBff fillwords as protocol errors (because it expects IDLEs) and fences the port. You see that as “er_bad_os” in portstatsshow.
    Cheers seb
    Cheers

  108. No, I won’t be there. My HDS T11 rep needs to schedule this but I first need to get some things out of the way from a technical perspective.

    It should go into FC-FS but I don’t know which release yet.

  109. seb

    Are you attending one of the f2f meetings about this topic? In which “workstream” is this discussed?

  110. I have something in the pipeline to address this. Will most likely be addressed at T11 in October or November.

  111. Hi Seb,

    Yes, there aren’t many blogs around which cover all this.
    There is more to come.

    Interesting to read your blog as well.

    Keep in touch

  112. seb

    Interesting article. Good to see more blogs about SAN troubleshooting!
    To the LR: In the situations where I traced it, the link reset usually happened after ED_TOV (2s) of running without a single buffer rather than the 500ms ASIC hold time. But maybe some devices start earlier with the recovery – both values are very long periods in the world of FC.
    Cheers seb

  113. seb

    Good article! The ugly thing with most of the multipathers is that they are not really built for problems like these. They can cope with links going down->be repaired->go up in perfect shape again. But if frames get dropped in the SAN due to slow drain devices, the multipather will notice a problem, but with a later TUR (Test Unit Ready) the path looks good for it again and will be further used. It would be better if the MPIO would track response times over the different paths and would move load towards better performing paths (avoiding the ones impacted by the slow drain device).
    Cheers seb

  114. Hi Roger,

    Thanks for your comments. You are right that the majority of Ficon and FC Extension products came to Brocade through various acquisitions. As for calling this a technological advantage I’m not sure. Both Brocade and Cisco have obtained significant knowledge in both spaces so when looking at the sum of RFE/RFP tick-boxes I wouldn’t know which ones would show an advantage. Acquisitions done over 6 years ago do not provide an advantage in my view. Most technology has been refreshed, standards have changed and engineers working at those companies might have been jumping borders to other avenues. If a company has no products, knowledge or R&D facilities in certain spaces (like Brocade back then had with Ficon and extension) then the fasted and easiest way is an acquisition. Buying McData was a good one, keeping the product-set on the sales inventory list wouldn’t have been my option)

    I do agree that Brocade most likely has an advantage with the variety of products which fit both enterprise and SMB markets but Cisco also has products which fit both spaces. I guess in the end it comes down to customer requirements and Brocade likely has more products which can fit specific scenarios.

    As I mentioned Brocade has been around since the dawn of the FC time and hence they have a longer track-record in the FC space. This is most likely the reason why customers still look to Brocade in the FC arena and to Cisco in the networking space.

    Again, thanks for your response.
    Erwin

  115. Hi Erwin,

    (i’m an IBM employee)

    Good post thank you. As ex- Inrange/CNT/McData employee i have a slightly other view to this. Looking from the ESCON/FICON view into the Datacenter i see some clear advantages for Brocade (because of the heritage coming from Inrange/CNT think on the Inrange FC9000 director, CNT UMD director etc). Here Cisco is lacking behind. In the FC arena i see the most of my clients in Small- and Medium market jump on Brocade products because of the variety of there products.

    -Roger

  116. Hi David, thanks for your comment.

    As for AoE its a very lean and mean protocol. Coraid is one of the vendors using it. The problem is that you need a flat layer 2 Ethernet network since nothing is route-able and on large scale storage architectures it might be limited. Personally I never used it.

    You might want to check out the Tech Field day website (http://techfieldday.com/2012/coraid-presents-storage-field-day-1/) where Coraid presented. They also had a presentation on the protocol differences. Be aware that presentations and Q&A are always biased πŸ™‚

    As with iSCSI, AoE also can have its place in the market but currently it looks like a very niche one to me.

  117. StorageFreak

    Just curious… I stumbled across your article due to hearing about ATA-over-Ethernet (AoE) that’s been built into the Linux kernel for a few years now.

    What would your take be on AoE? I assume same security concerns– I’m not really sure I see an advantage of AoE over iSCSI.

    Your take?

    David

  118. Anonymous

    Hi Erwin,

    The use of FCID forces host CNA to have to directly connect to FCF to acquire FCID allocated by the FCF (FCF maintains centralized SNS database), at least for current FC-BB-5 standard. Maybe FC-BB-6 has way to assign FCID w/o directly connecting CNA to FCF but that could add great deal of complexity, for example, FC-BB-6 introduces distributed FDF.

    I maybe wrong but just my 2 cents.

    Thanks.

    Colin

  119. Kris

    To see the same output on a Cisco MDS/Nexus, use:

    # show interface fc1/1 transceiver detail

  120. Anonymous,

    I don’t think the addressing itself is the problem but more the complexity it introduces.

    Thanks for you comments.

    Kind regards,
    Erwin

  121. Anonymous

    I think you are right on. The problem with FCoE is that it is building a networking protocol (FC) over another networking protocol (Ethernet). FC uses 24-bit FCID for addressing but Ethernet uses 48-bit MAC for addressing, that is unnecessary and can only complicate design and cause more inter-op issues. It is absolutely possible to design a light-weight transport protocol to directly transport SCSI command/data over Ethernet.

  122. Damian

    Hi,
    thanks for the article, now I understand why VMware had problems with it. I agree with Chris that it’s the storage array’s fault, but I also think that there is a simple solution:
    The array should maintain a bit-flag “not_written_since_last_reassignment” per physical block, stating that the physical block has been unmapped before and not written to since then. As long as the flag is still nonzero (i.e. true), the array should return all-zero-blocks on any read-request. Using such flags there would be no need to physically zero unmapped blocks.

  123. Thanks for a good write up on the SCSI Unmap issues.

  124. Anonymous,

    I’m sorry to hear about your situation. When I was in the same boat 2 years ago I was in contact with a outplacement organisation who helped me getting things back on track w.r.t. resumes etc. Did your former employer provided you this as well?

    Unfortunately I hear more and more stories of peoples lives getting wrecked because of this misused visa.

    I hope you’ll find something soon.

    Regards,
    Erwin

  125. Anonymous

    With you on this. In my case, the 457 was the ONLY way I could get into the country. Employers were not willing to wait for the RSMS, and I had just barely missed the age cutoff for the other migration options. (In a cruel twist of fate, because I came to Oz when I did, I also miss the revised points test announced last year, due to not having the right mix of ranges of years of employment in Australia and my home country.)

    So now I am redundant in a backwater town in which there are no other companies that require someone with my skills. Here I am looking for work over the Christmas season, where the odds aren’t as good as they would be even a month later. At this point, I will work anywhere willing to take up my sponsorship for a 457, but my ultimate goal is permanent residency.

  126. As a second comment Hu Yoshida and Michael Heffernan have also posted some comments on their blogs on http://www.hds.com.

    Check them out.

  127. Hi Chris,

    Indeed 42MB is not a bad number and due to this fact we don’t have this much CPU overhead plus we can use a very good sequential algorithm to get this sorted whereas if you have much smaller segment sizes these could be scattered all over a diskpool which causes much more randomness and hence a significant performance impact. (Yes this 42MB didn’t drop out of the sky for nothing since it has many more effectiveness all over the place w.r.t external storage, HDP, HDT etc….. We always told you we have smart folk on board. :-))

    W.r.t. to the T10 UNMAP being secure or insecure or whether it should have this option that’s up for debate. Be aware that pools can be shared between hypervisors and non-hypervisors as well as RAW apps like Oracle. Also how does a hypervisor determine when a certain LBA needs to be securely erased or not. If the OS does some memory swapping the hypervisor is really unable to determine if this is due to the memory swap process or if an application is deleting files. As I mentioned in one of my first blog posts the lack of data intelligence mapped onto the infrastructure is massive to such an extend that every piece of storage equipment out there has absolutely no clue of what’s going on on a application levels (some exceptions do apply) . This is where the real integration should be. This means that the apps should determine whether an UNMAP operation should be wiped or not and not have the hypervisors interfere with these kind of operations.

    Anyway, lets see what the future brings.

    Thanks again for your reply.

    Cheers
    Erwin

  128. Erwin

    This draws up a number of questions;

    First, 42MB might not be such a bad number after all, as the number of block release an remappings would be significantly less than with, say, 16KB. Next, you would expect VMware would *not* be releasing blocks back to the free pool on a regular basis unless some kind of threshold was met. For example, as VMFS marks a block as free, if it kept (say) 10% blocks in a local queue and reused them, then there would be less need to push off the release to the array.

    I think also it means that the implementation of UNMAP on the underlying array is flawed. T10 & VMware and the storage vendors should have implemented both a secure and unsecured UNMAP, one which wipes data, one which doesn’t, rather than assuming everything is so critical that it needs to be wiped at release. Where arrays are only connected to VMware, then an unsecure UNMAP might be acceptable as new blocks would zeroed out on usage.

    Perhaps we’ll see some of these functions coming along as they mature.

    Cheers
    Chris

  129. SSD will indeed solve the performance factor but now they have to work on capacity as well as reliability.

    I modified the font size. Looks like Goofle adjusts it to the smallest possible to be able to include the tables as they should look.

  130. Very interesting and timely…. leads to the question of SSDs..

    Only complaint is that the font is coming out really small. Not sure if its the browser or WordPress? Or maybe your eyes are better than mine…. #;-)

  131. I agree, the only thing you need to with FC is create a zone. Voila plug and play. πŸ™‚

    Of course you can add a gazzilion functions and features but the bare essence is domain-id and zones and you’re good to go.

    FCoE is designed to run a flat L2 Ethernet layer which by definition is not rout-able. If you want to introduct FC routing on FCoE yiou have to do very weird tricks and I wonder how long a troubleshooting case would take in such an environment. πŸ™‚ Veeeeeerrrryyyy lllooooonnnnngggg…..

    Thanks for your support.

  132. Nice to see someone giving a contrary view to all the FcoE bleeting, must do the same….

    What I find most annoying are complaints that Fibre Channel is hard to setup. Its simply not true.

    Amusingly the most complex thing I face in FC, is setting up FC routing… and guess what? FcOE cannot do that. #;-)

  133. Have a listen to this podcast:
    http://packetpushers.net/show-38-comparing-switch-fabrics-juniper-brocade-cisco/

    Really interesting stuff, but worrying about the implications of lack of interop….

  134. Hi Edwin
    Good article and solid points. There was an article published in the Register which is a nice complement to what you’ve written entitled “Brocade says FCOE may not happen”. I think the title says it all (-;

    http://www.theregister.co.uk/2011/02/10/brocade_says_fcoe_may_not_happen/

  135. JL

    Hello Erwin,
    I agree. (short form).
    Inventions not always brings a tool to establish “real” TCO.
    What I call “panacea”.
    Because someone does it, someone else has to do it, a bit like some people wearing Christian Dior: the need was to get dressed, not to spend millions.
    CPU Storage Sharing suffered the same. Sharing protocol because storage was expensive was an attempt to avoid cost… but when adding expert cost to make it run and maintain it (which no one does) plus down time due to human errors because too many protocols are shared… then sharing does not look that good anymore.
    It is like pooling things.
    Sometimes being “dynamic” is ok to do. Sometimes benefits are greater with static, because of management cost.
    Great article Erwin.