Tag Archives: maintenance

Appalling state of Storage Networks

In the IT world a panic is most often related to a operating systems kernel to run out of resources or some unknown state where it cannot recover and stops. Depending on the configuration it might dump its memory contents to a file or a remote system. This allows developers to check this file and look for the reason why it happened.

A fibre-channel switch running Linux as OS is no different but the consequences can be far more severe.

Continue reading

5-minute initial troubleshooting on Brocade equipment

Very often I get involved in cases whereby a massive amount of host logs, array dumps, FC and IP traces are taken which could easily add up to many gigabytes of data. This is then accompanied by a very synoptic problem description such as “I have a problem with my host, can you check?”.
I’m sure the intention is good to provide us all the data but the problem is the lack of the details around the problem. We do require a detailed explanation of what the problem is, when did it occur or is it still ongoing?

There are also things you can do yourself before opening a support ticket. In many occasions you’ll find that the feedback you get from us in 10 minutes results in either the problem being fixed or a simple workaround has made your problem creating less of an impact. Further troubleshooting can then be done in a somewhat less stressful time frame.

This example provides some bullet points what you can do on a Brocade platform. (Mainly since many of the problems I see are related to fabric issues and my job is primarily focused on storage networking.)

First of all take a look at the over health of the switch:

switchstatusshow
Provides an overview of the general components of the switch. These all need to show up HEALTHY and not (as shown here) as “Marginal”

Sydney_ILAB_DCX-4S_LS128:FID128:admin> switchstatusshow
Switch Health Report Report time: 06/20/2013 06:19:17 AM
Switch Name: Sydney_ILAB_DCX-4S_LS128
IP address: 10.XXX.XXX.XXX
SwitchState: MARGINAL
Duration: 214:29

Power supplies monitor MARGINAL
Temperatures monitor HEALTHY
Fans monitor HEALTHY
WWN servers monitor HEALTHY
CP monitor HEALTHY
Blades monitor HEALTHY
Core Blades monitor HEALTHY
Flash monitor HEALTHY
Marginal ports monitor HEALTHY
Faulty ports monitor HEALTHY
Missing SFPs monitor HEALTHY
Error ports monitor HEALTHY

All ports are healthy

switchshow
Provides a general overview of logical switch status (no physical components) plus a list of ports and their status.

  • The switchState should alway be online.
  • The switchDomain should have a unique ID in the fabric.
  • If zoning is configured it should be in the “ON” state.

As for the ports connected these should all be “Online” for connected and operational ports. If you see ports showing “No_Sync” whereby the port is not disabled there is likely a cable or SFP/HBA problem.

If you have configured FabricWatch to enable portfencing you’ll see indications like here with port 75

Obviously for any port to work it should be enabled.

Sydney_ILAB_DCX-4S_LS128:FID128:admin> switchshow
switchName: Sydney_ILAB_DCX-4S_LS128
switchType: 77.3
switchState: Online
switchMode: Native
switchRole: Principal
switchDomain: 143
switchId: fffc8f
switchWwn: 10:00:00:05:1e:52:af:00
zoning: ON (Brocade)
switchBeacon: OFF
FC Router: OFF
Fabric Name: FID 128
Allow XISL Use: OFF
LS Attributes: [FID: 128, Base Switch: No, Default Switch: Yes, Address Mode 0]

Index Slot Port Address Media Speed State Proto
============================================================
0 1 0 8f0000 id 4G Online FC E-Port 10:00:00:05:1e:36:02:bc “BR48000_1_IP146” (downstream)(Trunk master)
1 1 1 8f0100 id N8 Online FC F-Port 50:06:0e:80:06:cf:28:59
2 1 2 8f0200 id N8 Online FC F-Port 50:06:0e:80:06:cf:28:79
3 1 3 8f0300 id N8 Online FC F-Port 50:06:0e:80:06:cf:28:39
4 1 4 8f0400 id 4G No_Sync FC Disabled (Persistent)
5 1 5 8f0500 id N2 Online FC F-Port 50:06:0e:80:14:39:3c:15
6 1 6 8f0600 id 4G No_Sync FC Disabled (Persistent)
7 1 7 8f0700 id 4G No_Sync FC Disabled (Persistent)
8 1 8 8f0800 id N8 Online FC F-Port 50:06:0e:80:13:27:36:30
75 2 11 8f4b00 id N8 No_Sync FC Disabled (FOP Port State Change threshold exceeded)
76 2 12 8f4c00 id N4 No_Light FC Disabled (Persistent)

sfpshow slot/port
One of the most important pieces of a link irrespective of mode and distance is the SFP. On newer hardware and software it provides a lot of info on the overall health of the link.

With older FOS codes there could have been a discrepancy of what was displayed in this output as to what actually was plugged in the port. The reason was that the SFP’s get polled so every now and then for status and update information. If a port was persistent disabled it didn’t update at all so in theory you plug in another SFP but sfpshow would still display the old info. With FOS 7.0.1 and up this has been corrected and you can also see the latest polling time per SFP now.

The question we often get is: “What should these values be?”. The answer is “It depends”. As you can imagine a shortwave 4G SFP required less amps then a longwave 100KM SFP so in essence the SFP specs should be consulted. As a ROT you can say that signal quality depends ont he TX power value minus the link-loss budget. The result should be withing the RX Power specifications of the receiving SFP.

Also check the Current and Voltage of the SFP. If an SFP is broken the indication is often it draws no power at all and you’ll see these two dropping to zero.

Sydney_ILAB_DCX-4S_LS128:FID128:admin> sfpshow 1/1
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 540c404000000000 2,4,8_Gbps M5,M6 sw Short_dist
Encoding: 1 8B10B
Baud Rate: 85 (units 100 megabaud)
Length 9u: 0 (units km)
Length 9u: 0 (units 100 meters)
Length 50u (OM2): 5 (units 10 meters)
Length 50u (OM3): 0 (units 10 meters)
Length 62.5u:2 (units 10 meters)
Length Cu: 0 (units 1 meter)
Vendor Name: BROCADE
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
Vendor Rev: A
Wavelength: 850 (units nm)
Options: 003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max: 0
BR Min: 0
Serial No: UAF110480000NYP
Date Code: 101125
DD Type: 0x68
Enh Options: 0xfa
Status/Ctrl: 0x80
Alarm flags[0,1] = 0x5, 0x0
Warn Flags[0,1] = 0x5, 0x0
Alarm Warn
low high low high
Temperature: 25 Centigrade -10 90 -5 85
Current: 6.322 mAmps 1.000 17.000 2.000 14.000
Voltage: 3290.2 mVolts 2900.0 3700.0 3000.0 3600.0
RX Power: -3.2 dBm (476.2uW) 10.0 uW 1258.9 uW 15.8 uW 1000.0 uW
TX Power: -3.3 dBm (472.9 uW) 125.9 uW 631.0 uW 158.5 uW 562.3 uW

State transitions: 1
Last poll time: 06-20-2013 EST Thu 06:48:28

porterrshow
For link state counters this is the most useful command in the switch however there is a perception that this command provides a “silver” bullet to solve port and link issues but that is not the case. Basically it provides a snapshot of the content of the LESB (Link Error Status Block) of a port at that particular point in time. It does not tell us when these counters have accumulated and over which time frame. So in order to create a sensible picture of the statuses of the ports we need a baseline. This baseline can be created to reset all counters and start from zero. To do this issue the “statsclear” command on the cli.

There are 7 columns you should pay attention to from a physical perspective.

enc_in – Encoding errors inside frames. These are errors that happen on the FC1 with encoding 8 to 10 bits and back or, with 10G and 16G FC from 64 bits to 66 and back. Since these happen on the bits that are part of a data frame that are counted in this column.

crc_err – An enc_in error might lead to a CRC error however this column shows frames that have been market as invalid frames because of this crc-error earlier in the datapath. According to FC specifications it is up to the implementation of the programmer if he wants to discard the frame right away or mark it as invalid and send it to the destination anyway. There are pro’s and con’s on both scenarios. So basically if you see crc_err in this column it means the port has received a frame with an incorrect crc but this occurred further upstream.

crc_g_eof – This column is the same as crc_err however the incoming frames are NOT marked as invalid. If you see these most often the enc_in counter increases as well but not necessarily. If the enc_in and/or enc_out column increases as well there is a physical link issue which could be resolved by cleaning connectors, replacing a cable or (in rare cases) replacing the SFP and/or HBA. If the enc_in and enc_out columns do NOT increase there is an issue between the SERDES chip and the SFP which causes the CRC to mismatch the frame. This is a firmware issue which could be resolved by upgrading to the latest FOS code. There are a couple of defects listed to track these.

enc_out – Similar to enc_in this is the same encoding error however this error was outside normal frame boundaries i.e. no host IO frame was impacted. This may seem harmless however be aware that a lot of primitive signals and sequences travel in between normal data frame which are paramount for fibre-channel operations. Especially primitives which regulate credit flow. (R_RDY and VC_RDY) and signal clock synchronization are important. If this column increases on any port you’ll likely run into performance problems sooner or later or you will see a problem with link stability and sync-errors (see below).

Link_Fail – This means a port has received a NOS (Not Operational) primitive from the remote side and it needs to change the port operational state to LF1 (Link Fail 1) after which the recovery sequence needs to commence. (See the FC-FS standards specification for that)

Loss_Sync – Loss of synchronization. The transmitter and receiver side of the link maintain a clock synchronization based on primitive signals which start with a certain bit pattern (K28.5). If the receiver is not able to sync its baud-rate to the rate where it can distinguish between these primitives it will lose sync and hence it cannot determine when a data frame starts.

Loss_Sig – Loss of Signal. This column shows a drop of light i.e. no light (or insufficient RX power) is observed for over 100ms after which the port will go into a non-active state. This counter increases often when the link-loss budget is overdrawn. If, for instance, a TX side sends out light with -4db and the receiver lower sensitivity threshold is -12 db. If the quality of the cable deteriorates the signal to a value lower than that threshold, you will see the port bounce very often and this counter increases. Another culprit is often unclean connectors, patch-panels and badly made fibre splices. These ports should be shut down immediately and the cabling plant be checked. Replacing cables and/or bypassing patch-panels is often a quick way to find out where the problem is.

The other columns are more related to protocol issues and/or performance problems which could be the result of a physical problem but not be a cause. In short look at these 7 columns mentioned above and check if no port increases a value.

============================================
too_short/too_long – indicates a protocol error where SOF or EOF are observed too soon or too late. These two columns rarely increase.

bad_eof – Bad End-of-Frame. This column indicates an issue where the sender has observed and abnormality in a frame or it’s transceiver whilst the frameheader and portions of the payload where already send to its destination. The only way for a transceiver to notify the destination is to invalidate the frame. It truncates the frame and add an EOFni or EOFa to the end. This signals the destination that the frame is corrupt and should be discarded.

F_Rjt and F_Bsy are often seen in Ficon environments where control frames could not be processes in time or are rejected based on fabric configuration or fabric status.

c3timout (tx/rx) – These are counters which indicate that a port is not able to forward a frame in time to it’s destination. These either show a problem downstream of this port (tx) or a problem on this port where it has received a frame meant to be forwarded to another port inside the sames switch. (rx). Frames are ALWAYS discarded at the RX side (since that’s where the buffers hold the frame). The tx column is an aggregate of all rx ports that needs to send frames via this port according to the routing tables created by FSPF.

pcs_err – Physical Coding Sublayer – These values represent encoding errors on 16G platforms and above. Since 16G speeds have changed to 64/66 bits encoding/decoding there is a separate control structure that takes care of this.

As a best practise is it wise to keep a trace of these port errors and create a new baseline every week. This allows you to quickly identify errors and solve these before they can become an problem with an elongated resolution time. Make sure you do this fabric-wide to maintain consistency across all switches in that fabric.

Sydney_ILAB_DCX-4S_LS128:FID128:admin> porterrshow
frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs
tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err
0: 100.1m 53.4m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1: 466.6k 154.5k 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2: 476.9k 973.7k 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 474.2k 155.0k 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Make sure that all of these physical issues are solved first. No software can compensate for hardware problems and all OEM support organizations will give you this task anyway before commencing on the issue.

In one of my previous articles I wrote about problems, the cause and the resolution of physical errors. You can find it over here

Regards,
Erwin

One rotten apple spoils the bunch – 3

In the previous 2 blog-posts we looked at some areas why a fibre-channel fabric still might have problems even with all redundancy options available and MPIO checking for link failures etc etc.
The challenge is to identify any problematic port and act upon indications that certain problems might be apparent on a link.

So how do we do this in Brocade environments? Brocade has some features build into it’s FOS firmware which allows you to identify certain characteristics of your switches. One of them (Fabric-Watch) I briefly touched upon previously. Two other command which utilize Fabric_Watch are bottleneckmon and portfencing. Lets start with bottleneckmon.

Bottleneckmon was introduced in the FOS code stream to be able to identify 2 different kinds of bottlenecks: latency and congestion.

Latency is caused by a very high load to a device where the device cannot cope with the offered load however it does not exceed the capabilities of a link. As an example lets say that a link has a synchronized speeds of 4G however the load on that link reached no higher than 20MB/s and already the switch is unable to send more frames due to credit shortages. A situation like this will most certainly cause the sort of credit issues we’ve talked about before.

Congestion is when a link is overloaded with frames beyond the capabilities of the physical link. This often occurs on ISL and target ports when too many initiators are mapped on those links. This is often referred to as an oversubscribed fan-in ratio.

A congestion bottleneck is easily identified by looking at the offered load compared to the capability of the link. Very often extending the connection with additional links (ISL, trunk ports, HBA’s)  and spreading the load over other links or localizing/confining the load on the same switch or ASIC will most often help. Latency however is a very different ballgame. You might argue that Brocade also has a portcounter called tim_txcrd_zero  and when that reaches 0 pretty often you also have a latency device but that’s not entirely true. It may also mean that this link is very well utilized and is using all its credits. You should also see a fair link utilization w.r.t. throughput but be aware this also depends on frame size.

So how do we define a link as a high latency bottleneck? The bottleneckmon configuration utility provide a vast amount of parameters which you can use however I would advise to use the default settings as a start by just enabling bottleneck monitoring with the “bottleneckmon –enable” command. Also make sure you configure the alerting with the same command otherwise the monitoring will be passive and you’ll have to check each switch manually.

If a high latency device is caused by physical issues like encoding/decoding errors you will get notified by the bottleneckmon feature however when this happens in the middle of the night you most likely will not be able to act upon the alert in a timely fashion. As I mentioned earlier it is important to isolate this badly behaving device as soon as possible to prevent it from having an adverse effect on the rest of the fabric. The portfencing utility will help with that. You can configure certain thresholds on port-types and errors and if such a threshold has been reached the firmware will disable this port and alert you of it.

I know many administrators are very reluctant to have a switch take these kind of actions on its own and for a long time I agreed with that however seeing the massive devastation and havoc a single device can cause I would STRONGLY advise to turn this feature on. It will save you long hours of troubleshooting with elongated conference calls whilst your storage network is causing your application to come to a halt. I’ve seen it many times and even after pointing to a problem port very often the decision to disable such a port subject to change management politics. I would strongly suggest that if you have such guidelines in your policies NOW is the time to revise those policies and enable the intelligence of the switches to prevent these problem from occurring.

For some comprehensive overview, options and configuration examples I suggest you first take a look at the FOS admins guide of the latest FOS release versions. Brocade have also published some white-papers with more background information.

Regards,
Erwin

 

Brocade vs Cisco. The dance around DataCentre networking

When looking at the network market there is one clear leader and that is Cisco. Their products are ubiquitous from home computing to enterprise Of course there are others like Juniper, Nortel, Ericson but these companies only scratch the surface of what Cisco can provide. These companies rely on very specific differentiators and, given the fact they are still around, do a pretty good job at it.

A few years ago there was another network provider called Foundry and they had some really impressive products and I that’s mainly why these are only found in the core of data-centres which push a tremendous amount of data. The likes of ISP’s or  Internet Exchanges are a good fit. It is because of this reason Brocade acquired Foundry in July 2008. A second reason was that because Cisco had entered the storage market with the MDS platform. This gave Brocade no counterweight in the networking space to provide customers with an alternative.

When you look at the storage market it is the other way around. Brocade has been in the Fibre Channel space since day one. They led the way with their 1600 switches and have outperformed and out-smarted every other FC equipment provider on the planet. Many companies that have been in the FC space have either gone broke of have been swallowed by others. Names like Gadzoox, McData, CNT, Creekpath, Inrange and others have all vanished and their technologies either no longer exist or have been absorbed into products of vendors who acquired them.

With two distinct different technologies (networking & storage) both Cisco and Brocade have attained a huge market-share in their respective speciality. Since storage and networking are two very different beasts this has served many companies very well and no collision between the two technologies happened. (That is until FCoE came around; you can read my other blog posts on my opinion on FCoE).

Since Cisco, being bold, brave and sitting on a huge pile of cash, decided to also enter the storage market Brocade felt it’s market-share declining. It had to do something and thus Foundry was on the target list.

After the acquisition Brocade embarked on a path to get the product lines aligned to each other and they succeeded with  their own proprietary technology called VCS (I suggest you search for this on the web, many articles have been written). Basically what they’ve done with VCS is create an underlying technology which allows a flat level 2 Ethernet network operate on a flat fabric-based one which they have experiences with since the beginning of time (storage networking that is for them). 

Cisco wanted to have something different and came up with the technology merging enabled called FCoE. Cisco uses this extensively around their product set and is the primary internal communications protocol in their UCS platform. Although I don’t have any indicators yet it might well be that because FCoE will be ubiquitous in all of Cisco’s products the MDS platform might be abolished pretty soon from a sales perspective and the Nexus platforms will provide the overall merged storage and networking solution for Cisco data centre products which in the end makes good sense.

So what is my view on the Brocade vs. Cisco discussion. Well, basically, I do like them both. As they have different viewpoints of storage and networking there is not really a good vs bad. I see Brocade as the cowboy company providing bleeding edge, up to the latest standards, technologies like Ethernet fabrics and 16G fibre channel etc whereas Cisco is a bit more conservative which improves on stability and maturity. What the pros and cons for customers are I cannot determine since the requirement are mostly different.

From a support perspective on the technology side I think Cisco has a slight edge over Brocade since many of the hardware and software problems have been resolved over a longer period of time and, by nature, for Brocade providing bleeding edge technology with a “first-to-market” strategy may sometimes run into a bumpy ride. That being said since Cisco is a very structured company they sometimes lack a bit of flexibility and Brocade has an edge on that point.

If you ask me directly which vendor to choose when deciding a product set or vendor for a new data centre I have no preference. From a technology standpoint I would still separate fibre-channel from Ethernet and wait until both FCoE and Ethernet fabrics have matured and are well past their “hype-cycle”. We’re talking data centres here and it is your data. Not Cisco’s and not Brocade’s. Both FC and Ethernet are very mature and have a very long track-record of operations, flexibility and stability. The excellent knowledge there is available on each of these specific technologies gives me more piece of mind than the outlook of having to deal with problems bringing the entire data centre to a standstill.

Erwin

Maintenance

Why do I keep wondering why companies don’t maintain their infrastructure. It looks to be more of an exception than a rule to come along software and firmware which is newer than 6 months old. True, I admit, it’s a beefy task to keep this all lined up but then again you know this in advance so why isn’t there any plan to get this sorted every 6 months or even more frequent.


In my day-to-day job I see a lot of information from many customers around the world. Sometimes during implementation phases there is a lot of focus on interoperability and software certification stacks. Does Server A with HBA B work with switch Y on storage platform Z? This is very often a rubber stamping methodology which goes out the door with the project team. The moment a project has been done and dusted this very important piece is very often neglected due to many reasons. Most of them are time constraints, risk aversion, change freezes, fear something goes wrong etc. However when you look at the risk businesses take by taking chances not properly maintaining their software is like walking on a tightrope over the Grand Canyon with wind-gusts over 10 Beaufort. You might get a long way but sooner or later you will fall.

Vendors do actually fix things although some people might think otherwise. Remember in a storage array are around 800.000 pieces of hardware and a couple of million lines of software which make this thing run and store data for you. If you compare that to a car and would run the same maintenance schedule you’re requiring the car to run for 120 years non-stop without changing oil, filters, tyres, exhaust etc.etc. So would you trust your children in such a car after 2 years or even after 6 months. I don’t, but still businesses do seem to take the chances.

So is it fair for vendors to ask (or demand) you to maintain your storage environment. I think it is. Back in the days when I had my feet wet in the data-centers (figuratively speaking that is) once I managed a storage environment of around 250 servers, 18 FC switches and 12 arrays so a pretty beefy environment in those days. I’d set myself a threshold for firmware and drivers not to exceed a 4 month lifetime. That meant that if newer code came out from a particular vendor it was impossible that code was not implemented before those 4 months were over.
I spent around two days every quarter to generate the paperwork with change requests, vendor engineers appointments etc and 2 two days to get it all implemented. Voilà done.The more experience you become in this process the better and faster it will be done.

Another problem is with storage service providers, or service providers in general. They also are depending on their customers to get all the approvals stamped, signed and delivered which is very often seen a huge burden so they just let it go and eat the risk of something going wrong. The problem is that during RFP/RFI processes the customers do not ask for this, the sales people are not interested since this might become a show-stopper for them and as such nothing around the ongoing infrastructure maintenance is documented or contractually written in delivery or sales documents.

As a storage or service provider I would turn this “obligation” in my advantage and say: Dear Mr, customer, this is how we maintain our infrastructure environment so we, and you, are assured of immediate and accurate vendor support, have up-to-date infrastructure to minimize the risk of anything going downhill at some point in time.”

I’ve seen it happen were high severity issues with massive outages were logged with vendors where those vendors came back and say “Oh yes Sir, we know of this problem and we have fixed that a year ago but you haven’t implemented this level of firmware, it is far outdated”.

How would that look if you’re a storage/service provider to your customers or a bank who’s online banking system had some coverage in the newspapers??

Anyway, the moral is “Keep it working by maintaining your environment”.

BTW, at SNWUSA Spring 2011 I wrote a SNIA tutorial “Get Ready for Support” which outlines some of the steps you need to take before contacting your vendors’ support organisation. Just log on to www.snia.org/tutorials. It will be posted there after the event.

Cheers,
Erwin van Londen