Tag Archives: storage

Troubleshooting Linux Storage – My first book

When I asked a buddy of mine, who is a fairly prolific author, what his advice is when I asked him I wanted to write a book, he said “Don’t”. During the process I sometimes wished I’d taken that advice, but as it progressed and neared completion I did feel a real sense of satisfaction. Now, I’m not claiming to be an author and my style of writing is more along the same way I write my blog articles. Just down to earth without any fuss and simply trying to be a clear and concise as possible.

It took a bit longer than firstly anticipated as my previous employer had some issues with me writing such a book as well as the fact that I got diagnosed with some nasty disease I had to conquer, but here we are. I could finally press the “Publish” button.

So what is it about. As the subject says, Troubleshooting Linux Storage.

In my career as a support engineer, I’ve seen many issues popping up in a variety of circumstances at customers’ sites, ranging from very small to very large multinationals. A common factor has been that in many occasions there was confusion of what was actually happening and where any of the problems originated from. Now, I’m far from claiming I’m an expert at any layer of the Linux IO stack, but as I’ve been doing both storage and Linux for a fairly long time, I have a pretty good understanding of where to look when things go wrong, how to identify them and how to resolve them.

In the book, I’ve tried to capture a lot of what I know, and I hope it will help system administrators in diagnosing problems, resolving them and based on these experiences, prevent this from happening again.

Is it a complete bible of everything that can go wrong? I think there would not be enough trees in the world to provide the paper to print it on, nor would you be able to lift the book physically. Even just a Kindle version would seriously be stretching the storage capacity of the device. As always, you have to make decisions on what is useful to write and the necessity to refer to other sources. Most of the things in the book are of a practical nature around the troubleshooting art. It does contain a fair amount of links to other sources where needed.

As this is my first attempt of ever doing such a thing, I did not really want to go via one of the large publishing houses like o’Reilly or Starch Press. Maybe in the future that changes. That also means that from a publishing perspective this has been a one-man job, and you could encounter some irregularities that I may not have captured. When I do these will be corrected asap.

The book can be purchased via Amazon.

[amazon template=iframe image2&asin=B0BFX744GY]

It is also now available in digital format via Leanpub

https://leanpub.com/troubleshootinglinuxstorage

I welcome any feedback, good or bad, and appreciate suggestions, so I can improve the book in future versions and help more Linux system administrators.

Kind regards

Erwin

The technical pathways of Brocade in cloud storage adoption

Brocade isn’t always very forthcoming about what they are working on. Obviously a fair chunk of development and engineering efforts are spent on cloud integration  and enablement of their software and hardware stack into this computing methodology. Acquisitions like Foundry, Vyatta and now Connectem show that the horizon has broadened the views of Brocade. To keep up with the ever increasing demands for network features and functions it makes sense to review the current product lines they have and when you read between the lines you may be able to spot some interesting observations.

Cloud Storage

Continue reading

Fillwords IDLE vs ARBff (one last time)

I’ve written about fillwords a lot (see here, here, and here) but I didn’t show you much about the different symptoms an incorrect fillword setting may incur.

As you’ve seen fillwords are a very nifty way of maintaining bit and word-sync on a serial transmission link when no actual frames are sent. Furthermore they also are replaceable with other primitive signals (Like R_RDY, VC_RDY etc) to utilize a very simple instruction method between two ports without interfering with actual frames. That means that fillwords are ALWAYS squeezed in between frames.

SQUEEZE

Continue reading

Why Fibre-Channel has to improve

Many of you have used and managed fibre-channel based storage networks over the years. It comes to no surprise that a network protocol primarily developed to handle extremely time-sensitive operations is build with extreme demand regarding hardware and software quality and clear  guidelines on how these communications should proceed. It is due to this that fibre-channel has become the dominant protocol in datacenters for storage. Continue reading

Closing the Fibre-Channel resiliency gap

Fibre-Channel is still the predominant transport protocol for storage related data transmission. And rightfully so. Over the past +-two decades it has proven to be very efficient and extremely reliable in moving channel based data transmissions between initiators and targets. The reliability is due to the fact the underlying infrastructure is almost bulletproof. Fibre-Channel requires very high quality hardware as per FC-PI standard and a BER (Bit Error Rate) of less than 10^12 is not tolerated. What Fibre-Channel lacks though is an method of detection and notification to and from hosts if a path to a device is below the required tolerance levels which can cause frames to be dropped without a host to be able to adjust its behaviour on that path. Fibre-Channel relies on upper level protocols (like SCSI) to re-submit the command and that’s about it. When FC was introduced to the market back in the late 90’s, many vendors already had multipath software which could correlate multiple paths to the same LUN into one and in case one path failed it could switch over to the other. Numerous iterations further down the road nothing really exciting has been developed in that area. As per the nature of the chosen class-of-service (3) for the majority of todays FC implementations there is no error recovery done in an FC environment. As per my previous post you’ve seen that MPIO software is also NOT designed to act on and recover failed IO’s. Only in certain circumstances it will fail a path in a way that all new IO’s will be directed to one or more of the remaining paths to that LUN. The crux of the problem is that if any part of the infrastructure is less than what is required from a quality perspective and there is nothing on the host level that actively reacts on these kind of errors you will end up with a very poor performing storage infrastructure. Not only on the host that is active on that path but a fair chance exists other hosts will have the same or similar problems. (see the series of Rotten Apples in previous posts.)

So what is my proposal. Hosts should become more aware of the end-to-end error ratio of paths from their initiators to the targets and back. This results in a view where hosts (or applications) can make a judgement call of which path it can send an IO so the chances of an error are most slim. So how does this need to work. I is all about creating an inventory of least-error-paths. For this to be accomplished we need a way of getting this info to the respective hosts. Basically there are two ways of doing this. 1. Either create a central database which receives updates from each individual switch in the fabric and the host needs to query that database and correlate that with the FSPF (Fabric Shortest Path First) info in order to be able to sort things out or, and this would be my preferred way, we introduce a new ELS frame which can hold all counters currently specified in the LESB (Link Error Status Block) plus some more for future use or vendor specific info. I call this the Error Reporting with Integrated Notification frame. This frame is sent by the initiator to the target over all paths it has at its disposal. Each ingress port (RX port on each switch) which this frame traverses and increment the counters with its own values. When the target receives the frames it flips the SID and DID and send it back to the host. Given the fact this frame is still part of the same FC exchange it will also traverse the same path back so an accurate total error count of that path can be created.

Both options would enable each host of analyzing the overall error count on each path from HBA to target for each lun it has. The problem with option 1 is that the size of the database will increase exponentially proportional with the number of ports and paths and this might become such a huge size that it cannot live inside the fabric anymore and thus needs to be updated in an external management tool. This then has the disadvantage that host are depending on OOB network restrictions and security implications in addition to interop problematic issues. It also has the problem that path errors can be bursty depending on load and/or fabric behaviour. This management application will need to poll these switches for each individual port which will cause an additional load on the processors on each switch even while not necessary. Furthermore it is highly unreliable when the fabric is seeing a fair amount of changes which by default causes re-routing to occur and thus renders a calculation done by the host one minute ago, totally useless.

Option two has the advantage that there is one uniform methodology which is distributed on each initiator, target and path. It therefore has no impact on any switch and/or external management application and is also not relying on network (TCP/IP) related security restrictions, Ethernet boundaries caused by VLANS etc or any other external factors that could influence the operation.

The challenge is however that today there is no ASIC that supports this logic and even if I could get the proposal accepted by T11 it’ll take a while before this is enabled in hardware. In the meantime the ELS frame could be sent to the processor of the switch which in turn does the error count modification in the frame payload, CRC recalculation and other things required. Once more the bottleneck of this method will become the capability of the CPU of that particular switch especially when you have many high port-count blades installed . Until the ASICs are able to do this on the fly in hardware there will be less granularity from a timing perspective since each ELS frame will need to be sent to the CPU. To prevent the CPU from being flogged by all these update and pull requests in the transitional period there is an option to either extend the PLOGI to check if all ports in the path are able to support this frame in hardware or use this new ELS with a special “inventory bit” to determine the capabilities. If any port in the part does not not support the ELS frame it will flick it to 0. This allows the timing interval of each ELS frame to be inline with the capabilities of the end-to-end path. You can imagine that if all ports are able to do this in hardware you can achieve a much finer granularity on timing and hosts can respond much quicker on errors. If any port does not support the new ELS frame the timing can be adjusted to fall in between the E_D_TOV and R_A_TOV values (in general 2 and 10 seconds). The CPU’s on the switches are fairly capable to handle this. This is still much better than any external method of collecting fibre-channel port errors and having an out-of-band query and policy method. Another benefit is that there is a standard method of collecting and distributing the end-to-end path errors so even in multi-vendor environments it is not tied to a single management platform.

 

So lets look at an example.

erwin_els_example _1This shows a very simplistic SAN infrastructure where one host has 4 paths over two initiators to 2 storage ports each. All ports seem to be in tip-top shape.

erwin_els_example _2If one link (in this case between the switch and the storage controller) is observing errors (in any direction) the new ELS frame would increment that particular LESB counter value and the result would enable the host to detect the increase in any of the counters on that particular path. Depending on the policies of the operating system or application it could direct the MPIO software to mark that path failed and to remove it from the IO path list.

Similarly if a link shows errors between an initiator and switch it will mark 2 paths as bad and has the option to mark these both as failed.

erwin_els_example _3If you have a meshed fabric the number of paths exponentially grow with each switch and ISL you add. The below show the same structure but because an ISL has been added between the switches the number of potential paths between the host and LUN now grows to 8

erwin_els_example _4This means that if one of the links is bad the number of potential bad paths also duplicates.

erwin_els_example _5In this case the paths from both initiators on this bad link are marked faulty and can be removed from the target list by the MPIOsoftware.

erwin_els_example _7The fun really starts with the unexpected. Lets say you add an additional ISL and for some reason this one is bad. The additional ISL does not add new paths to the path-list on the host since this is transparent and is hidden by the fabric. Frames will just traverse one or the other irrespective of which patch is chosen by the host software. Obviously, since the ELS is just a normal frame, the error counters in the ELS might be skewed based on which of the two ISL’s it has been sent. Depending on the architecture of the switch you’ll have two options, either the ASIC accumulates all counters for both ports into one and add these onto the already existing counters, or you can use a divisional factor where the ASIC sums up all counters of the ISL’s and divides them by the number of ISL’s. The same can be done for trunks(brocade) / portchannels(Cisco). Given the fact that currently most the counters are used in 32bit transmission words the first option is likely to cause the counters to wrap very quickly. The second advantage of a divisional factor is that there will be a consistent averaging across all paths in case you have a larger meshed fabric and thus it will provide a more accurate feedback to the host.

I’m working out the details w.r.t. the internals of the ELS frame itself and which bits to use in which position.

This all should make Fibre-Channel in combination with intelligent host-based software an even more robust protocol for storage based data-transmissions.

Let me know what you think. Any comments, suggestions and remarks are highly appreciated.

Cheers

Erwin

Brocade FOS 7.1 and the cool features

After a very busy couple of weeks I’ve spent some time to dissect the release notes of Brocade FOS 7.1 and I must say there are some really nice features in there but also some that I REALLY think should be removed right away.

It may come to no surprise that I always look very critical to whatever come to the table from Brocade, Cisco and others w.r.t. storage networking. Especially the troubleshooting side and therefore the RAS capabilities of the hardware and software have a special place in my heart so if somebody screws up I’ll let them know via this platform. 🙂
So first of all some generics. FOS 7 is supported on the 8 and 16G platforms which cover the Goldeneye2,Condor2 and Condor 3 ASICs plus the AP blades for encryption, SAN extension and FCoE. (cough, cough)….Be aware that it doesn’t support the blades based on the older architecture such as the FR4-18i and FC10-6 (which I think was never bought by anyone.)  Most importantly this is the first version to support the new 6520 switch so if you ever think of buying one it will come shipped with this version installed. 
Software
As for the software features Brocade really cranked up the RAS features. I especially do like the broadening of the scope for D-ports (diagnostics port) to include ICL ports but also between Brocade HBA’s and switch ports. One thing they should be paying attention to though is that they should sell a lot more of these. :-). Also the characteristics of the test patterns such as test duration, frame-sizes and number of frames can now be specified. Also FEC (Forward Error Correction) has been extended to access gateways and long distance ports which should increase stability w.r.t. frame flow. (It still doesn’t improve on signal levels but that is a hardware problem which cannot be fixed by software).
There are some security enhancements for authentication such as extended LDAP and TACACS+ support.
The 7800 can now be used with VF albeit not having XISL functionality. 
Finally the E_D_TOV FC timer value is propagated onto the FCIP complex. What this basically means that previously even though an FC frame had long timed-out according to FC specs (in general 2 seconds) it could still exist on the IP network in a FCIP packet. The remote FC side would discard that frame anyway thus wasting valuable resources. With FOS 7.1 the FCIP complex on the sending side will discard the frame after E_D_TOV has expired.
One of the most underutilised features (besides Fabric Watch) is FDMI (Fabric Device Management Interface). This is a separate FC service (part of the new FC-GS-6 standard) which can hold a huge treasure box of info w.r.t. connected devices. As an example:
FDMI entru
————————————————-
        switch:admin> fdmishow
        Local HBA database contains:
          10:00:8c:7c:ff:01:eb:00
          Ports: 1
            10:00:8c:7c:ff:01:eb:00
              Port attributes:
                FC4 Types: 0x0000010000000000000000000000000000000000000000000000000000000000
                Supported Speed: 0x0000003a
                Port Speed: 0x00000020
                Frame Size: 0x00000840
                Device Name: bfa
                Host Name: X3650050014
                Node Name: 20:00:8c:7c:ff:01:eb:00
                Port Name: 10:00:8c:7c:ff:01:eb:00
                Port Type: 0x0
                Port Symb Name: port2
                Class of Service: 0x08000000
                Fabric Name: 10:00:00:05:1e:e5:e8:00
                FC4 Active Type: 0x0000010000000000000000000000000000000000000000000000000000000000
                Port State: 0x00000005
                Discovered Ports: 0x00000002
                Port Identifier: 0x00030200
          HBA attributes:
            Node Name: 20:00:8c:7c:ff:01:eb:00
            Manufacturer: Brocade
            Serial Number: BUK0406G041
            Model: Brocade-1860-2p
            Model Description: Brocade-1860-2p
            Hardware Version: Rev-A
            Driver Version: 3.2.0.0705
            Option ROM Version: 3.2.0.0_alpha_bld02_20120831_0705
            Firmware Version: 3.2.0.0_alpha_bld02_20120831_0705
            OS Name and Version: Windows Server 2008 R2 Standard | N/A
            Max CT Payload Length: 0x00000840
            Symbolic Name: Brocade-1860-2p | 3.2.0.0705 | X3650050014 |
            Number of Ports: 2
            Fabric Name: 10:00:00:05:1e:e5:e8:00
            Bios Version: 3.2.0.0_alpha_bld02_20120831_0705
            Bios State: TRUE
            Vendor Identifier: BROCADE
            Vendor Info: 0x31000000
———————————————-
and as you can see this shows a lot more than the fairly basic nameserver entries:
——————————————-
N    8f9200;      3;21:00:00:1b:32:1f:c8:3d;20:00:00:1b:32:1f:c8:3d; na
    FC4s: FCP 
    NodeSymb: [41] “QLA2462 FW:v4.04.09 DVR:v8.02.01-k1-vmw38”
    Fabric Port Name: 20:92:00:05:1e:52:af:00 
    Permanent Port Name: 21:00:00:1b:32:1f:c8:3d
    Port Index: 146
    Share Area: No
    Device Shared in Other AD: No
    Redirect: No 
    Partial: No
    LSAN: No
——————————————
Obviously the end-device needs to support this and it has to be enabled. (PLEASE DO !!!!!!!!) It’s invaluable for troubleshooters like me….
One thing that has bitten me a few times was the SFP problem. There has long been a problem that when a port was disabled and a new SFP was plugged in the switch didn’t detect that until the port was enabled and it had polled for up-to-date information. In the mean time you could get old/cached info of the old SFP including temperatures, db values, current, voltage etc.. This seems to be fixed now so thats one less thing to take into account.
Some CLI improvements have been made on various commands with some new parameters which lets you filter and select for certain errors etc.
The biggest idiocracy that has been made with this version is to allow the administrator change the severity level of event-codes. This means that if you have a filter in BNA (or whatever management software you have) to exclude INFO level messages but certain ERROR or CRITICAL messages start to annoy you you could change the severity to INFO and thus they don’t show up anymore. This doesn’t mean th problem is less critical so instead of just fixing the issue we now just pretend it’s not there. From a troubleshooting perspective this is disastrous since we look at a fair chuck of sup-saves each day and if we can’t rely on consistency in a log file it’s useless to have a look in the first place. Another one of those is the difference in deskew values on trunks when FEC is enabled. Due to a coding problem these values can differ up to 40 therefore normally depicting a massive difference in cable length. Only by executing a d-port analysis you can determine if that is really the case or not. My take is that they should fix the coding problem ASAP.  
A similar thing that has pissed me off was the change in sfpshow output. Since the invention of the wheel this has been the worst output in the brocade logs so many people have scripted their ass off to make it more readable.
Normally it looks like this:
=============
Slot  1/Port  0:
=============
Identifier:  3    SFP
Connector:   7    LC
Transceiver: 540c404000000000 2,4,8_Gbps M5,M6 sw Short_dist
Encoding:    1    8B10B
Baud Rate:   85   (units 100 megabaud)
Length 9u:   0    (units km)
Length 9u:   0    (units 100 meters)
Length 50u:  5    (units 10 meters)
Length 62.5u:2    (units 10 meters)
Length Cu:   0    (units 1 meter)
Vendor Name: BROCADE         
Vendor OUI:  00:05:1e
Vendor PN:   57-1000012-01   
Vendor Rev:  A   
Wavelength:  850  (units nm)
Options:     003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max:      0   
BR Min:      0   
Serial No:   UAF11051000039A 
Date Code:   101212  
DD Type:     0x68
Enh Options: 0xfa
Status/Ctrl: 0xb0
Alarm flags[0,1] = 0x0, 0x0
Warn Flags[0,1] = 0x0, 0x0
                                          Alarm                  Warn
                                   low        high       low         high
Temperature: 31      Centigrade    -10         90         -5          85
Current:     6.616   mAmps          1.000      17.000     2.000       14.000 
Voltage:     3273.4  mVolts         2900.0      3700.0    3000.0       3600.0 
RX Power:    -2.8    dBm (530.6uW) 10.0   uW 1258.9 uW   15.8   uW  1000.0 uW
TX Power:    -3.3    dBm (465.9 uW)125.9  uW   631.0  uW  158.5  uW   562.3  uW
and that is for every port which basically makes you nuts.
So with some bash,awk,sed magic I scripted the output to look like this:
Port  Speed   Long  Short  Vendor     Serial            Wave   Temp   Current  Voltage   RX-Pwr   TX-Pwr
wave wave number Length
1/0 8G NA 50 m BROCADE UAF11051000039A 850 31 6.616 3273.4 -2.8 -3.3
1/1 8G NA 50 m BROCADE UAF110510000387 850 32 7.760 3268.8 -3.6 -3.3
1/2 8G NA 50 m BROCADE UAF1105100003A3 850 30 7.450 3270.7 -3.3 -3.3
etc....
From a troubleshooting perspective this is so much easier since you can spot issues right away.
Now with FOS 7.1.x the FOS engineers screwed up the SFPshow output which inherently screwed up my script which necessitates a load more work/code/lines to get this back into shape. The same thing goes for the output on the number of credits on virtual channels.
Pre-FOS 7.1 it looks like this:
C:—— blade port 64: E_port ——————————————
C:0xca682400: bbc_trc                 0004 0000 002a 0000 0000 0000 0001 0001 
With FOS 7.1 it looks like this:
bbc registers
=============
0xd0982800: bbc_trc                 20   0    0    0    0    0    0    0    
(Yes, hair pulling stuff, aaarrrcchhhh)
Some more good things. The fabriclog now contains the direction of link resets. Previously we could only see an LR had occurred but we didn’t see who initiated it. Now we can and have the option to figure out in which direction credit issues might have been happening. (phew..)
The CLI history is now also saved after reboots and firmware-upgrades. Its been always a PITA to figure out who had done what at a certain point-in-time. This should help to try and find out.
One other very useful thing that has been added and it a major plus in this release is the addition of the remote WWNN of a switch in the switchshow and islshow output even when the ISL has segmented for whatever reason. This is massively helpful because normally you didn’t have a clue what was connected so you also needed to go through quite some hassle and check cabling or start digging through the portlogdump with some debug flags enabled. Always a troublesome exercise. 
The bonus points from for this release is the addition of the fabretrystats command. This gives us troubleshooters a great overview of statistics of fabric events and commands. 
SW_ILS
————————————————————————————————————–
E/D_Port ELP  EFP  HA_EFP DIA  RDI  BF   FWD  EMT  ETP  RAID GAID ELP_TMR  GRE  ECP  ESC  EFMD ESA  DIAG_CMD 
————————————————————————————————————–
0        0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
69       0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
71       0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
79       0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
131      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
140      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
141      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
148      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
149      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
168      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
169      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
174      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
175      0    0    0      0    0    0    0    0    0    0    0    0        0    0    0    0    0    0        
This release also fixes a gazillion defects so its highly advisable to get to this level better sooner than later. Check with your vendor for the latest supported release.
So all in all good stuff but some things should be reverted, NOW!!!. and PLEASE BROCADE: don’t screw up more output in such a way it breaks existing analysis scripts etc…
Cheers
Erwin

Does SNIA matter?

If your not into enterprise storage and live in France you might confuse the acronym with the Brotherhood of Infirmary Anaesthetics (Syndicat National des Infirmiers-Anesthésistes) but if you relate it to storage you obviously end up at the Storage Networking Industry Association.

This organisation is founded based upon the core values of developing vendor neutral storage technologies and open standards that enhance the overall usability of storage in general. 

In addition the SNIA organises events such as Storage Networking World, Storage Developers Conference, summits and it also provides a lot of vendor neutral education with it’s own certification path.There are world-wide chapters who each organise their local gigs and can provide help and support on general storage related issues.

The question is though, to what extend is SNIA able to steer the storage industry to a point on the horizon that is both beneficial to customers as to the vendors. The biggest issue is that the entire SNIA organisation lives by the grace of it’s members, which are primarily vendors. Although you, as a customer or system integrator or anyone else interested, can become a member and make proposals, you have to bring a fairly large bag of coins to become a voting member and have the ability to somewhat influence the pathways of the storage evolution.

The SNIA does not directly steer development of technologies which are under the umbrella of the INCITS, IEEE, IETF and ISO standards bodies. Although many vendors are part of both organisations you will find that the well established standards such as FibreChannel, SCSI, Ethernet, TCPIP are developed in these respective bodies.

So should you care about SNIA?

YES !!!!. You certainly need to. The SNIA is a not-for-profit organisation which provides a very good overview of where storage technology is at every stage. It started of in 1997 shortly after storage went from DAS to SAN. Over the years it has provided the industry with numerous exciting technologies which enhanced storage networking in general. Some examples are SMI-S, CDMI, CSI, XAM etc. Some of these technologies evolved into products used by vendors and others have either ceased to exist due to lack of vendor support or customer demand.

If you’re fairly new in the storage business the SNIA is an excellent start to get acquainted with storage concepts, protocols and general storage technologies without any bias to vendors. This allows to remain clear minded of options and provides the ability to start of your career in this exciting, fast pace business. I would advise to have a look at the course and certification track and recommend to get certified. It gives you a good start with some credibility and at least you know what the pundits in the industry talk about when they mention distributed filesystems, FC, block vs file etc etc.

I briefly mentioned the events they organise. If you want to know who’s who in the storage zoo a great place to visit is SNW (Storage Networking World), an event organised twice a year in the US on both the east and west coast. All major vendors are around (at least they should in my view) and it gives you a great opportunity to check out what they have on their product list.
The next great event is SDC (Storage Developers Conference) which quite easily outsmarts most other geek events. This event is where everyone comes together who knows storage to the binary level. This is the event where individual file-system blocks are unravelled, HBA API’s are discussed and all the new and exciting cloud technologies are debunked. So if you’re into some real technical deep-dives this is the event to visit.

Although questions have been raised whether SNIA is relevant at all I think it is and it should be supported by anyone with an interest in storage technologies.

I’m curious about your thoughts.

Regards,
E

Brocade vs Cisco. The dance around DataCentre networking

When looking at the network market there is one clear leader and that is Cisco. Their products are ubiquitous from home computing to enterprise Of course there are others like Juniper, Nortel, Ericson but these companies only scratch the surface of what Cisco can provide. These companies rely on very specific differentiators and, given the fact they are still around, do a pretty good job at it.

A few years ago there was another network provider called Foundry and they had some really impressive products and I that’s mainly why these are only found in the core of data-centres which push a tremendous amount of data. The likes of ISP’s or  Internet Exchanges are a good fit. It is because of this reason Brocade acquired Foundry in July 2008. A second reason was that because Cisco had entered the storage market with the MDS platform. This gave Brocade no counterweight in the networking space to provide customers with an alternative.

When you look at the storage market it is the other way around. Brocade has been in the Fibre Channel space since day one. They led the way with their 1600 switches and have outperformed and out-smarted every other FC equipment provider on the planet. Many companies that have been in the FC space have either gone broke of have been swallowed by others. Names like Gadzoox, McData, CNT, Creekpath, Inrange and others have all vanished and their technologies either no longer exist or have been absorbed into products of vendors who acquired them.

With two distinct different technologies (networking & storage) both Cisco and Brocade have attained a huge market-share in their respective speciality. Since storage and networking are two very different beasts this has served many companies very well and no collision between the two technologies happened. (That is until FCoE came around; you can read my other blog posts on my opinion on FCoE).

Since Cisco, being bold, brave and sitting on a huge pile of cash, decided to also enter the storage market Brocade felt it’s market-share declining. It had to do something and thus Foundry was on the target list.

After the acquisition Brocade embarked on a path to get the product lines aligned to each other and they succeeded with  their own proprietary technology called VCS (I suggest you search for this on the web, many articles have been written). Basically what they’ve done with VCS is create an underlying technology which allows a flat level 2 Ethernet network operate on a flat fabric-based one which they have experiences with since the beginning of time (storage networking that is for them). 

Cisco wanted to have something different and came up with the technology merging enabled called FCoE. Cisco uses this extensively around their product set and is the primary internal communications protocol in their UCS platform. Although I don’t have any indicators yet it might well be that because FCoE will be ubiquitous in all of Cisco’s products the MDS platform might be abolished pretty soon from a sales perspective and the Nexus platforms will provide the overall merged storage and networking solution for Cisco data centre products which in the end makes good sense.

So what is my view on the Brocade vs. Cisco discussion. Well, basically, I do like them both. As they have different viewpoints of storage and networking there is not really a good vs bad. I see Brocade as the cowboy company providing bleeding edge, up to the latest standards, technologies like Ethernet fabrics and 16G fibre channel etc whereas Cisco is a bit more conservative which improves on stability and maturity. What the pros and cons for customers are I cannot determine since the requirement are mostly different.

From a support perspective on the technology side I think Cisco has a slight edge over Brocade since many of the hardware and software problems have been resolved over a longer period of time and, by nature, for Brocade providing bleeding edge technology with a “first-to-market” strategy may sometimes run into a bumpy ride. That being said since Cisco is a very structured company they sometimes lack a bit of flexibility and Brocade has an edge on that point.

If you ask me directly which vendor to choose when deciding a product set or vendor for a new data centre I have no preference. From a technology standpoint I would still separate fibre-channel from Ethernet and wait until both FCoE and Ethernet fabrics have matured and are well past their “hype-cycle”. We’re talking data centres here and it is your data. Not Cisco’s and not Brocade’s. Both FC and Ethernet are very mature and have a very long track-record of operations, flexibility and stability. The excellent knowledge there is available on each of these specific technologies gives me more piece of mind than the outlook of having to deal with problems bringing the entire data centre to a standstill.

Erwin

SCSI UNMAP and performance implications

When listening to Greg Knieriemens’ podcast on Nekkid Tech there was some debate on VMWare’s decisision to disable the SCSI UNMAP command on vSphere 5.something. Chris Evans (www.thestoragearchitect.com) had some questions why this has happened so I’ll try to give a short description.

Be aware that, although I work for Hitachi, I have no insight in the internal algorithms of any vendor but the T10 (INCITS) specifications are public and every vendor has to adhere to these specs so here we go.

With the introduction of thin provisioning in the SBC3 specs a whole new can of options, features and functions came out of the T10 (SCSI) committee which enabled applications and operating systems to do all sorts of nifty stuff on storage arrays. Basically it meant you could give a host a 2 TB volume whilst in the background you only had 1TB physically available. The assumption with thin provisioning (TP) is that a host or application wont use that 2 TB in one go anyway so why pre-allocate it.

So what happens is that the storage array will provide the host with a range of addressable LBA’s (Logical Block Addresses) which the host is able to use to store data. In the back-end on the array these LBA’s are then only allocated upon actual use. The array has one or more , so called, disk pools where it can physically store the data. The mapping from the “virtual addressable LBA” which the host sees and the back-end physical storage is done by mapping tables. Depending on the implementation between the different vendor certain “chunks” out of these pools are reserved as soon as one LBA is allocated. This prevents performance bottlenecks from a housekeeping perspective since it doesn’t need to manage each single LBA mapping. Each vendor has different page/chunks/segment sizes and different algorithms to manage these but the overall method of TP stay the same.

So lets say the segment size on an array is 42MB (:-)) and an application is writing to an LBA which falls into this chunk. The array updates the mapping tables, allocates cache-slots and all the other housekeeping stuff that is done when a write IO is coming in.  As of that moment the entire 42 MB is than allocated to that particular LUN which is presented to that host. Any subsequent write to any LBA which falls into this 42MB segment is just a regular IO from an array perspective. No additional overhead is needed or generated w.r.t. TP maintenance. As you can see this is a very effective way of maintaining an optimum capacity usage ratio but as with everything there are some things you have to consider as well like over provisioning and its ramifications when things go wrong.

Lets assume that is all under control and move on.

Now what happens if data is no longer needed or deleted. Lets assume a user deletes a file which is 200MB big (video for example) In theory this file had occupied at least 5 TP segments of 42MB. But since many filesystems are very IO savvy they do not scrub the entire 42MB back to zero but just delete the FS entry pointer and remove the inodes from the inode table. This means that only a couple of bytes effectively have been removed on the physical disk and array cache.
The array has no way of knowing that these couple of bytes, which have been returned to 0, represent an entire 200MB file and as such these bytes are still allocated in cache, on disk and the TP mapping table. This also means that these TP segments can never be re-mapped to other LUN’s for more effective use if needed. To overcome this there have been some solutions to overcome this like host-based scrubbing (putting all bits back to 0), de-fragmentation to re-align all used LBA’s and scrub the rest and some array base solutions to check if segments do contain on zero’s and if so remove them from the mapping table and therefore make the available for re-use.

As you can imagine this is not a very effective way of using TP. You can be busy clearing things up on a fairly regular basis so there had to be another solution.

So the T10 friends came up with two new things namely “write same” and “unmap”. Write same does exactly what it says. It issues a write command to a certain LBA and tells the array to also write this bit stream to a certain set of LBA’s. The array then executes this therefore offloading the host from keeping track of all the write commands so it can do more useful stuff than pushing bits back and forth between himself and the array. This can be very useful if you need to deploy a lot of VM’s which by definition have a very similar (if not exactly) the same pattern. The other way around it has a similar benefit that if you need to delete VM’s (or just one) the hypervisor can instruct the array to clear all LBA’s associated with that particular VM and if the UNMAP command is used in conjunction with the write same command you basically end up with the situation you want. The UNMAP command instructs the array that a certain LBA (LBA’s) are no longer in use by this host and therefore can be re-used in the free pool.

As you can imagine if you just use the UNMAP command this is very fast from a host perspective and the array can handle this very quickly but here comes the catch. If the host instructs the array to UNMAP the association between the LBA and the LUN it is basically only a pointer from the mapping table that is removed. the actual data does still exist either in cache or on disk. If that same segment is then re-allocated to another host in theory this particular host can issue a read command to any given LBA in that segment and retrieve the data that was previously written by the other system. Not only can this confuse the operating system but it also implies a huge security risk.

In order to prevent this the array has one or more background threads to clear out these segments before they are effectively returned to the pool for re-use. These tasks normally run on a pretty low priority to not interfere with normal host IO. (Remember that it still is (or are) the same CPU(s) who have to take care of this.) If CPU’s are fast and the background threads are smart enough under normal circumstances you hardly see any difference in performance.

As with all instruction based processing the work has to be done either way, being it the array or the host. So if there is a huge amount of demand where hypervisors move around a lot of VM’s between LUN’s and/or arrays, there will be a lot of deallocation (UNMAP), clearance (WRITE SAME) and re-allocation of these segments going on. It depends on the scheduling algorithm at what point the array will decide to reschedule the background and frontend processes so that the will be a delay in the status response to the host. On the host it looks like a performance issue but in essence what you have done is overloading the array with too many commands which normally (without thin provisioning) has to be done by the host itself.

You can debate if using a larger or smaller segment size will be beneficial but that doesn’t matter at all. If you use a smaller segment size the CPU has much more overhead in managing mapping tables whereas using bigger segment sizes the array needs to scrub more space on deallocation.

So this is the reason why VMWare had disabled the UNMAP command in this patch since a lot of “performance problems” were seen across the world when this feature was enabled. Given the fact that it was VMWare that disabled this you can imagine that multiple arrays from multiple vendors might be impacted in some sense otherwise they would have been more specific on array vendors and types which they haven’t done.