Tag Archives: performance

Speed mismatch is the death-trap for shared storage

I’ve been focusing on the implications of physical issues a lot in my posts over the last ~2 years. What I haven’t touched on is logical performance boundaries which also cause extreme grief in many storage infrastructures which lead to performance problems, IO errors, data-corruption  and other nasty stuff you do not want to see in your storage network.

Speedometer

 

Continue reading

Performance expectations with ISL compression

So this week I had an interesting case. As you know the Hitachi arrays have a replication functionality called HUR (Hitachi Universal Replicator) which is an advanced a-synchronous replication solution offered for Mainframe and OpenSystems environments. HUR does not use a primary to secondary push method but rather the target system is issuing reads to the primary array after which this one sends the required data. This optimizes the traffic pattern by using batches of data. From a connectivity perspective you will mostly see full 2K FC frames which means on long distance connections (30KM to 100KM) you can very effectively keep a fairly low number of buffer-credits on both sides whilst still maintain optimum link utilization.

Continue reading

Closing the Fibre-Channel resiliency gap

Fibre-Channel is still the predominant transport protocol for storage related data transmission. And rightfully so. Over the past +-two decades it has proven to be very efficient and extremely reliable in moving channel based data transmissions between initiators and targets. The reliability is due to the fact the underlying infrastructure is almost bulletproof. Fibre-Channel requires very high quality hardware as per FC-PI standard and a BER (Bit Error Rate) of less than 10^12 is not tolerated. What Fibre-Channel lacks though is an method of detection and notification to and from hosts if a path to a device is below the required tolerance levels which can cause frames to be dropped without a host to be able to adjust its behaviour on that path. Fibre-Channel relies on upper level protocols (like SCSI) to re-submit the command and that’s about it. When FC was introduced to the market back in the late 90’s, many vendors already had multipath software which could correlate multiple paths to the same LUN into one and in case one path failed it could switch over to the other. Numerous iterations further down the road nothing really exciting has been developed in that area. As per the nature of the chosen class-of-service (3) for the majority of todays FC implementations there is no error recovery done in an FC environment. As per my previous post you’ve seen that MPIO software is also NOT designed to act on and recover failed IO’s. Only in certain circumstances it will fail a path in a way that all new IO’s will be directed to one or more of the remaining paths to that LUN. The crux of the problem is that if any part of the infrastructure is less than what is required from a quality perspective and there is nothing on the host level that actively reacts on these kind of errors you will end up with a very poor performing storage infrastructure. Not only on the host that is active on that path but a fair chance exists other hosts will have the same or similar problems. (see the series of Rotten Apples in previous posts.)

So what is my proposal. Hosts should become more aware of the end-to-end error ratio of paths from their initiators to the targets and back. This results in a view where hosts (or applications) can make a judgement call of which path it can send an IO so the chances of an error are most slim. So how does this need to work. I is all about creating an inventory of least-error-paths. For this to be accomplished we need a way of getting this info to the respective hosts. Basically there are two ways of doing this. 1. Either create a central database which receives updates from each individual switch in the fabric and the host needs to query that database and correlate that with the FSPF (Fabric Shortest Path First) info in order to be able to sort things out or, and this would be my preferred way, we introduce a new ELS frame which can hold all counters currently specified in the LESB (Link Error Status Block) plus some more for future use or vendor specific info. I call this the Error Reporting with Integrated Notification frame. This frame is sent by the initiator to the target over all paths it has at its disposal. Each ingress port (RX port on each switch) which this frame traverses and increment the counters with its own values. When the target receives the frames it flips the SID and DID and send it back to the host. Given the fact this frame is still part of the same FC exchange it will also traverse the same path back so an accurate total error count of that path can be created.

Both options would enable each host of analyzing the overall error count on each path from HBA to target for each lun it has. The problem with option 1 is that the size of the database will increase exponentially proportional with the number of ports and paths and this might become such a huge size that it cannot live inside the fabric anymore and thus needs to be updated in an external management tool. This then has the disadvantage that host are depending on OOB network restrictions and security implications in addition to interop problematic issues. It also has the problem that path errors can be bursty depending on load and/or fabric behaviour. This management application will need to poll these switches for each individual port which will cause an additional load on the processors on each switch even while not necessary. Furthermore it is highly unreliable when the fabric is seeing a fair amount of changes which by default causes re-routing to occur and thus renders a calculation done by the host one minute ago, totally useless.

Option two has the advantage that there is one uniform methodology which is distributed on each initiator, target and path. It therefore has no impact on any switch and/or external management application and is also not relying on network (TCP/IP) related security restrictions, Ethernet boundaries caused by VLANS etc or any other external factors that could influence the operation.

The challenge is however that today there is no ASIC that supports this logic and even if I could get the proposal accepted by T11 it’ll take a while before this is enabled in hardware. In the meantime the ELS frame could be sent to the processor of the switch which in turn does the error count modification in the frame payload, CRC recalculation and other things required. Once more the bottleneck of this method will become the capability of the CPU of that particular switch especially when you have many high port-count blades installed . Until the ASICs are able to do this on the fly in hardware there will be less granularity from a timing perspective since each ELS frame will need to be sent to the CPU. To prevent the CPU from being flogged by all these update and pull requests in the transitional period there is an option to either extend the PLOGI to check if all ports in the path are able to support this frame in hardware or use this new ELS with a special “inventory bit” to determine the capabilities. If any port in the part does not not support the ELS frame it will flick it to 0. This allows the timing interval of each ELS frame to be inline with the capabilities of the end-to-end path. You can imagine that if all ports are able to do this in hardware you can achieve a much finer granularity on timing and hosts can respond much quicker on errors. If any port does not support the new ELS frame the timing can be adjusted to fall in between the E_D_TOV and R_A_TOV values (in general 2 and 10 seconds). The CPU’s on the switches are fairly capable to handle this. This is still much better than any external method of collecting fibre-channel port errors and having an out-of-band query and policy method. Another benefit is that there is a standard method of collecting and distributing the end-to-end path errors so even in multi-vendor environments it is not tied to a single management platform.

 

So lets look at an example.

erwin_els_example _1This shows a very simplistic SAN infrastructure where one host has 4 paths over two initiators to 2 storage ports each. All ports seem to be in tip-top shape.

erwin_els_example _2If one link (in this case between the switch and the storage controller) is observing errors (in any direction) the new ELS frame would increment that particular LESB counter value and the result would enable the host to detect the increase in any of the counters on that particular path. Depending on the policies of the operating system or application it could direct the MPIO software to mark that path failed and to remove it from the IO path list.

Similarly if a link shows errors between an initiator and switch it will mark 2 paths as bad and has the option to mark these both as failed.

erwin_els_example _3If you have a meshed fabric the number of paths exponentially grow with each switch and ISL you add. The below show the same structure but because an ISL has been added between the switches the number of potential paths between the host and LUN now grows to 8

erwin_els_example _4This means that if one of the links is bad the number of potential bad paths also duplicates.

erwin_els_example _5In this case the paths from both initiators on this bad link are marked faulty and can be removed from the target list by the MPIOsoftware.

erwin_els_example _7The fun really starts with the unexpected. Lets say you add an additional ISL and for some reason this one is bad. The additional ISL does not add new paths to the path-list on the host since this is transparent and is hidden by the fabric. Frames will just traverse one or the other irrespective of which patch is chosen by the host software. Obviously, since the ELS is just a normal frame, the error counters in the ELS might be skewed based on which of the two ISL’s it has been sent. Depending on the architecture of the switch you’ll have two options, either the ASIC accumulates all counters for both ports into one and add these onto the already existing counters, or you can use a divisional factor where the ASIC sums up all counters of the ISL’s and divides them by the number of ISL’s. The same can be done for trunks(brocade) / portchannels(Cisco). Given the fact that currently most the counters are used in 32bit transmission words the first option is likely to cause the counters to wrap very quickly. The second advantage of a divisional factor is that there will be a consistent averaging across all paths in case you have a larger meshed fabric and thus it will provide a more accurate feedback to the host.

I’m working out the details w.r.t. the internals of the ELS frame itself and which bits to use in which position.

This all should make Fibre-Channel in combination with intelligent host-based software an even more robust protocol for storage based data-transmissions.

Let me know what you think. Any comments, suggestions and remarks are highly appreciated.

Cheers

Erwin

Brocade Fabric Watch – The most underutilised feature

Many customer cases I handle are related to poor connectivity. A connectivity problem can be caused by unclean connectors, broken cables or SFP’s. (See one of my earlier blog posts).
Although the switches are capable or identifying physical issues and subsequently notifying administrators, it’s  hardly ever being followed up. Very often an acute issue is lingering for days before an administrator starts investigating and in many cases this is only because of a server admin start complaining of SCSI errors or IO time-outs or very poor performance.
So how do we prevent this from happening? Well, for starters make sure that your environment is clean. With this I mean you should make sure that all connectors are not exposed to dust or other types of contamination. Secondly try to handle cables with care. I’ve seen many cases where cables were under so much tension that Jimmy Hendrix would be able to compose one of his finest works on it. Although modern fibre cables are fairly rugged and are able to handle a fair amount of tension try not to test this. At a last bullet point I would suggest to keep an eye out on light emitting power ratios. As you most likely know lasers do not have an infinite lifetime and their transmission power will decrease over time. At some point in time the receiving end of a link is most likely no longer able to distinguish between on or off in a reliable manner and as such the 8b10b (or 64b/66b) encoding/decoding algorithm will start to detect bit flips and as such it will discard a transmission word. The upper and lower power requirements are published in the data-sheets so as soon as one of these values reach their lower values replace them.

Now you might argue that if you have 10000 ports in your fabric you might have other things to worry about than checking SFP power values every day. The stress put on storage admins is not decreasing the last time I looked so this will most likely not be the case for the years to come.

Fortunately you don’t have to. Both Brocade and Cisco provide option to monitor each individual component. For many years Brocade has one of the best embedded management tools there is namely Fabric Watch (FW). FW is not an active management tool per-se however the underlying goal is to have a sort of self-healing and protecting framework to monitor, alert and take action on events that might have implications on overall fabric behaviour.

A single dodgy link can have significant implications on overall fabric behaviour which can, and will, impact many hosts depending on topology and traffic pattern. FW allows you to set thresholds on many items in a switch from SFP power values, link errors, temperature readings etc etc. Each of these items can be configured with certain characteristics like above,below,in-between or change values. On each of these a time frame can be configured.

Now lets take an example on a link that has some intermittent errors. Your applications tolerate a certain error ratio per time-frame that they can recover from so in case on or two IO errors per hour are seen by the OS or application it will re-send the read or write command and all is good. If however, this starts to increase you might end up with the application going down or even data corruption. If you have configured FW to send a notification in case the amount of errors increase beyond the application tolerance, you will be able to take some action and investigate were the problem might be.

Now there is another issue and that is that you’re most likely not sitting behind a console 24×7 or monitoring emails during your holidays. So even if you do get notified there is a good chance you will not notice it. (I know I won’t when I’m playing golf :-))
These call for some more drastic measures and this is also covered by FW. If a certain threshold increases beyond a warning level and reaches a critical level FW allows you to take some action right away. This is a feature Brocade call port-fencing. Basically what it means is that this threshold is met it will just disable the port to prevent it from propagating the problems further up in the fabric. This is REALLY an area you SHOULD investigate. It can save you from having many issues showing up all over the fabric.

The title of this blog post is unfortunately the status as it now stands with most of the installed base of fabrics and the reason seems to be that administrators have a problem with software deciding on disruptive actions like disabling ports. My argument is that this port is already in a degraded state plus it also causes other links in the entire fabric having problems. If you don’t know what your looking for and have this large 10000 port fabric it will take you a significant amount of time before you know what’s going on. In this time many, many more hosts and applications can and will suffer from significant performance and other problems which might create some significant overtime for many people.

Regards,
Erwin

SCSI UNMAP and performance implications

When listening to Greg Knieriemens’ podcast on Nekkid Tech there was some debate on VMWare’s decisision to disable the SCSI UNMAP command on vSphere 5.something. Chris Evans (www.thestoragearchitect.com) had some questions why this has happened so I’ll try to give a short description.

Be aware that, although I work for Hitachi, I have no insight in the internal algorithms of any vendor but the T10 (INCITS) specifications are public and every vendor has to adhere to these specs so here we go.

With the introduction of thin provisioning in the SBC3 specs a whole new can of options, features and functions came out of the T10 (SCSI) committee which enabled applications and operating systems to do all sorts of nifty stuff on storage arrays. Basically it meant you could give a host a 2 TB volume whilst in the background you only had 1TB physically available. The assumption with thin provisioning (TP) is that a host or application wont use that 2 TB in one go anyway so why pre-allocate it.

So what happens is that the storage array will provide the host with a range of addressable LBA’s (Logical Block Addresses) which the host is able to use to store data. In the back-end on the array these LBA’s are then only allocated upon actual use. The array has one or more , so called, disk pools where it can physically store the data. The mapping from the “virtual addressable LBA” which the host sees and the back-end physical storage is done by mapping tables. Depending on the implementation between the different vendor certain “chunks” out of these pools are reserved as soon as one LBA is allocated. This prevents performance bottlenecks from a housekeeping perspective since it doesn’t need to manage each single LBA mapping. Each vendor has different page/chunks/segment sizes and different algorithms to manage these but the overall method of TP stay the same.

So lets say the segment size on an array is 42MB (:-)) and an application is writing to an LBA which falls into this chunk. The array updates the mapping tables, allocates cache-slots and all the other housekeeping stuff that is done when a write IO is coming in.  As of that moment the entire 42 MB is than allocated to that particular LUN which is presented to that host. Any subsequent write to any LBA which falls into this 42MB segment is just a regular IO from an array perspective. No additional overhead is needed or generated w.r.t. TP maintenance. As you can see this is a very effective way of maintaining an optimum capacity usage ratio but as with everything there are some things you have to consider as well like over provisioning and its ramifications when things go wrong.

Lets assume that is all under control and move on.

Now what happens if data is no longer needed or deleted. Lets assume a user deletes a file which is 200MB big (video for example) In theory this file had occupied at least 5 TP segments of 42MB. But since many filesystems are very IO savvy they do not scrub the entire 42MB back to zero but just delete the FS entry pointer and remove the inodes from the inode table. This means that only a couple of bytes effectively have been removed on the physical disk and array cache.
The array has no way of knowing that these couple of bytes, which have been returned to 0, represent an entire 200MB file and as such these bytes are still allocated in cache, on disk and the TP mapping table. This also means that these TP segments can never be re-mapped to other LUN’s for more effective use if needed. To overcome this there have been some solutions to overcome this like host-based scrubbing (putting all bits back to 0), de-fragmentation to re-align all used LBA’s and scrub the rest and some array base solutions to check if segments do contain on zero’s and if so remove them from the mapping table and therefore make the available for re-use.

As you can imagine this is not a very effective way of using TP. You can be busy clearing things up on a fairly regular basis so there had to be another solution.

So the T10 friends came up with two new things namely “write same” and “unmap”. Write same does exactly what it says. It issues a write command to a certain LBA and tells the array to also write this bit stream to a certain set of LBA’s. The array then executes this therefore offloading the host from keeping track of all the write commands so it can do more useful stuff than pushing bits back and forth between himself and the array. This can be very useful if you need to deploy a lot of VM’s which by definition have a very similar (if not exactly) the same pattern. The other way around it has a similar benefit that if you need to delete VM’s (or just one) the hypervisor can instruct the array to clear all LBA’s associated with that particular VM and if the UNMAP command is used in conjunction with the write same command you basically end up with the situation you want. The UNMAP command instructs the array that a certain LBA (LBA’s) are no longer in use by this host and therefore can be re-used in the free pool.

As you can imagine if you just use the UNMAP command this is very fast from a host perspective and the array can handle this very quickly but here comes the catch. If the host instructs the array to UNMAP the association between the LBA and the LUN it is basically only a pointer from the mapping table that is removed. the actual data does still exist either in cache or on disk. If that same segment is then re-allocated to another host in theory this particular host can issue a read command to any given LBA in that segment and retrieve the data that was previously written by the other system. Not only can this confuse the operating system but it also implies a huge security risk.

In order to prevent this the array has one or more background threads to clear out these segments before they are effectively returned to the pool for re-use. These tasks normally run on a pretty low priority to not interfere with normal host IO. (Remember that it still is (or are) the same CPU(s) who have to take care of this.) If CPU’s are fast and the background threads are smart enough under normal circumstances you hardly see any difference in performance.

As with all instruction based processing the work has to be done either way, being it the array or the host. So if there is a huge amount of demand where hypervisors move around a lot of VM’s between LUN’s and/or arrays, there will be a lot of deallocation (UNMAP), clearance (WRITE SAME) and re-allocation of these segments going on. It depends on the scheduling algorithm at what point the array will decide to reschedule the background and frontend processes so that the will be a delay in the status response to the host. On the host it looks like a performance issue but in essence what you have done is overloading the array with too many commands which normally (without thin provisioning) has to be done by the host itself.

You can debate if using a larger or smaller segment size will be beneficial but that doesn’t matter at all. If you use a smaller segment size the CPU has much more overhead in managing mapping tables whereas using bigger segment sizes the array needs to scrub more space on deallocation.

So this is the reason why VMWare had disabled the UNMAP command in this patch since a lot of “performance problems” were seen across the world when this feature was enabled. Given the fact that it was VMWare that disabled this you can imagine that multiple arrays from multiple vendors might be impacted in some sense otherwise they would have been more specific on array vendors and types which they haven’t done.

Why disk drives have become slower over the years

What is the first question vendors get (or at least used to get) when a customer (non-technical) calls???
I’ll spare you the guesswork: “What does a TB of disks do at your place ??”. Usually I’ll goofle around for the cheapest disk at a local PC store and say “Well Sir, that would be about 80 dollars”. I then hear somebody falling of the chair, trying to get up again, reach for the phone and with a resonating voice asking “Why are your competitors so expensive then?”. “They most likely did not gave a direct answer to your question.”, I reply.
The thing is an HDD should be evaluated on multiple factors and when you spend 80 bucks on a 1TB disk you get capacity and that’s about it. Don’t expect performance or extended MTBF figures let alone all the stuff than comes with enterprise arrays like large caches, redundancy in every sense and a lot more. This is what makes up the price per GB.


“Ok, so why have disk drives become so much slower in the past couple of years?”. Well, they haven’t. The RPM, seek time and latency have remained the same over the last couple of years. The problem is that the capacity has increased so much that the so called “access density” has increased linearly so the disk has to service a massive amount of bytes with the same nominal IOPS capability.


I did some simple calculations which shows the decrease in performance on larger disks. I didn’t assume any raid or cache accelerators.


I first calculated a baseline based on a 100GB disk drive (I know, they don’t exist but it just for the calculations) with 500GB of data that I need to read or write.


The assumption was to have a 100% random read profile. Although the host can read or write in increments of 512 bytes IO size theoretically this doesn’t mean the disk will write this IO in one sequential stroke. An 8K host IO can be split up in the smallest supported sector size on disk which is currently around 512 bytes. (Don’t worry, every disk and array will optimize this but again this is just to show the nominal differences)


So when I have 100GB disk drive this translates to a little over 190 million sectors.  In order to read 500 GB of data this would take a theoretical 21.7 minutes. The number of disks are calculated based on the capacity required for that 500GB (Also remember that disks use a base10 capacity value whereas operating systems,memory chips and other electronics use a base2 value so that’s 10^3 vs 2^10.)

Baseline
Sectors RPM Avrg delay in ms Max IOPS Disks required 6
100 190,734,863 10000 8 125 Num IOPS 750
Time Required 1302
in minutes 21.7

 If you now take this baseline and map this to some previous and current disk types and capacities you can see the differences.

GB Sectors RPM # Disk per A29 Num IOPS Time required in sec in min %pcnt of base line * times base value
9 17,166,138 7200 57 4731 206 3.44 15.83 6.32
18 34,332,275 7200 29 2407 406 6.77 31.19 3.21
36 68,664,551 10000 15 1875 521 8.69 40.02 2.50
72 137,329,102 10000 8 1000 977 16.29 75.04 1.33
146 278,472,900 10000 4 500 1953 32.55 150 0.67
300 572,204,590 10000 2 250 3906 65.1 300 0.33
450 858,306,885 10000 2 250 3906 65.1 300 0.33
600 1,144,409,180 10000 1 125 7813 130.22 600.08 0.17

You can see here that capacity wise to store the same 500GB on 146 GB disks you need less disks but you also get fewer total IOPS. This then translates into slower performance. As an example a 300GB drive with 10000RPM triples the time compared to the baseline disk to read this 500 gigabyte.


Now these a re relatively simple calculations however they do apply to all disks including the ones in your disk array.


I hope this also makes you start thinking about performance as well as capacity. I’m pretty sure your business finds it most annoying when your users need to get a cup of coffee after every database query. 🙂