Tag Archives: management

The technical pathways of Brocade in cloud storage adoption

Brocade isn’t always very forthcoming about what they are working on. Obviously a fair chunk of development and engineering efforts are spent on cloud integration  and enablement of their software and hardware stack into this computing methodology. Acquisitions like Foundry, Vyatta and now Connectem show that the horizon has broadened the views of Brocade. To keep up with the ever increasing demands for network features and functions it makes sense to review the current product lines they have and when you read between the lines you may be able to spot some interesting observations.

Cloud Storage

Continue reading

2 – Creating the management landscape

The first thing to think of when creating a Brocade storage network has nothing to do with a switch configuration at all. Its about the surrounding management landscape. There are numerous management software options available from a multitude of vendors including Brocade with Network Advisor. If you have purchased BNA then congratulations, you have a solid toolkit in your hands to take care of nearly everything you need to manage, maintain and troubleshoot the SAN. Despite this there are more things to take into account.

Continue reading

Why Fibre-Channel has to improve

Many of you have used and managed fibre-channel based storage networks over the years. It comes to no surprise that a network protocol primarily developed to handle extremely time-sensitive operations is build with extreme demand regarding hardware and software quality and clear  guidelines on how these communications should proceed. It is due to this that fibre-channel has become the dominant protocol in datacenters for storage. Continue reading

The first law of the Time Lords | Aussie Storage Blog

A buddy of mine posted this article and it reminded me of the presentation I did for the Melbourne VMUG back in April of this year.

The first law of the Time Lords | Aussie Storage Blog:

If you have ever worked in support (or had the need to check on events in general as an administrator) you know how important it is to have an accurate timestamp. Incorrect clock settings are a nightmare if you want to correlate event that are logged on different times and dates.

When you look at the hyper scale of virtualised environments you will see that the vertical stack of IO options is almost 10 fold. Lets have a look from top to bottom where you can set the clock.

  1. Application
  2. VM, the virtual machine
  3. Hypervisor
  4. network switches
  5. The 1st tier storage platform (NAS/iSCSI)
  6. A set of FC switches
  7. The second tier storage platform
  8. Another set of FC switches
  9. The virtualised storage array
Which in the end might look a bit like this. (Pardon my drawing skills)
As you can imagine it’s hard enough to start figuring out where an error has occurred but when all of these stacks have different time settings it’s virtually impossible to dissect the initial cause.
So what do you set on each of these? That brings us to the question of “What is time”. A while ago I watched a video of a presentation by Jordan Sissel (who is working full-time on a open-source project called LogStash). One of his slides outlines the differences in timestamps.:
So besides the different time-formats you encounter in the different layers of the infrastructure imagine what it is like to first get all these back into a human readable format and then aligned across the entire stack. 
While we’re not always in a position to modify the time/date format we can make sure that at least the time setting is correct. In order to do that make sure you use NTP and also set the correct timezone. This way the clocks in the different layers of the stack across the entire infrastructure say aligned and correct. 
You will help yourself and your support organisation a great deal.
Thanks,
Erwin van Londen

Storage in 2013 and beyond.

It’s comes to no surprise that a couple of technologies really struck in 2012. Flash disk drives, and specifically in flash arrays, have gone mainstream. One more technology still clinging on is converged networking and of course Big Data.

Big Data has become such a hype-word that many people have different opinions and descriptions for it. What is basically boils down to is that too many people have too much stuff hanging around which they never clean up or remove. This undeniably causes a huge burden on many IT departments who only have one answer: Add more disks……..

So where do we go from here. There is no denial that exabyte type storage environments become more apparent in many companies and government agencies. The question is what is being done with all these “dead” bytes. Will they even be used again. What is being done to safeguard this information?

Some studies show that the cost of managing this old data outgrows the benefit one could obtain from it. The problem is there are so many really useful and beneficial pieces of data in this enormous pile of bits but none of them are classified and tagged as such. This makes the “delete all” option a no-go but the costs of actually determining what needs to be kept can run side-by-side with keeping it all. We can be fairly certain that neither of the two options can hack it in the long run. Something has to be done to actually harvest the useful information and finally get rid of the old stuff.

The process of classification needs to be via heuristic mathematical deterministics. A mouth full but what it actually means is that every piece of information needs to be tagged with a value. Lets call this value X. This X is generated based upon business requirements related to the type of business we’re actually in. Whilst indexing the entire information base certain words, values, and other pieces of information appear more often than others. These indicators can cause a certain information type to obtain a higher value then others and there ranks higher (ie the X value increases). Of course you can have a multitude of information streams where one is by definition larger and causes data to appear more frequent in which case it rank higher even though the actual business value is not that great whilst you might have a very small project going on that could generate a fair chunk of your annual revenue. To identify those these need to be tagged with a second value called Y. And last but not least we have age. Since all data loses its accuracy and therefore value the data needs to be tagged with a third value called Z.

Based upon these three values we can create 3 dimensional value maps which can be projected on different parts of the organization. This outlines and quantifies where the most valuable data resides and where the most savings can be obtained. This allows for a far more effective process of data elimination and therefore huge cost savings. Different mathematical algorithms already exist however have not been applied in this way and therefore such technologies do not exist yet. Maybe something for someone to pick up. Good luck.

As for the logical parts of the Big Data question in 2013 we will will see a bigger shift towards object based storage. If you go back to one of my first articles you will see that I predicted this shift 6 years ago. Data objects need to get smarter and more intelligent by nature in order to increase value and manageability. By doing this we can think of all sorts of smarts to utilize the information to the fullest extend.

As for the other, more tangible technologies my take on them is as follows.

Flash

Flash technology will continue to evolve en price erosion will, at some point, will cause it to compete with normal disks but that is still a year or two away. R&D costs will still have a major burden on the price point of these drives/arrays so as the uptake of flash continues it will level out. Reliability has mostly been tackled by advances in redundancy and cell technology so that argument can be mostly negated. My take on dedicated flash arrays is that these are too limited in their functions and therefore overpriced. The only benefit they provide is performance but that is easily countered by the existing array vendors by adding dedicated flash controllers and optimized internal data-paths in their equipment. The benefit is that these can utilize the same proven functions that have been available for years. One of the most useful and cost-effective is of course auto-tiering which allows to have optimum usage is gives the most bang for your buck.

Converged networking

Well, what can I say. If designed and implemented correctly it just works but many companies are just not ready from a knowledge standpoint to adopt it. There are just too many differentiation in processes, knowledge and many other point which conflicts between the storage and networking folks. The arguments I ventilated in my previous post have still not been countered by anyone and as such my standpoint has not changed. If reliability and uptime is one of your priorities than don’t start with converged networking. Of course there are some exceptions. If for instance use want to buy a Cisco UCP then this system runs converged networking internally from front-to-back but there is not really much than is configurable so the “Oeps” factor is significantly minimized.

Processor and overall system requirements

More and more focus will be placed upon power requirements and companies will be forcing vendors to the extreme to reduce the amount of watts their systems suck from the wall socket. Software developers are strongly encouraged (and that’s an understatement) to sift through their code and check if optimizations can be achieved in this area.

Legal

A short look on the techno news sites in 2012 and you’ve probably noticed an increase in court cases were people are held responsible for breaches in confidentiality and  availability of information infrastructures. This will become a real battle with outsourced cloud services in the very near future. Cloud providers like AWS, Rackspace and Microsoft negate all responsibility w.r.t. to service/data-availability and uptime in their terms of use and contracts but just how far can they stretch this? There will be some point in time where courts will hold these provides accountable and you will see a major shift in requirements these providers will put in their infrastructures. All this will of course have significant ramifications on pricing and cloud expectations will have to be adjusted.

Hope you all have a good 2013 and we’ll see if some of these will gain some uptake.

Regards,
Erwin

Brocade Fabric Watch – The most underutilised feature

Many customer cases I handle are related to poor connectivity. A connectivity problem can be caused by unclean connectors, broken cables or SFP’s. (See one of my earlier blog posts).
Although the switches are capable or identifying physical issues and subsequently notifying administrators, it’s  hardly ever being followed up. Very often an acute issue is lingering for days before an administrator starts investigating and in many cases this is only because of a server admin start complaining of SCSI errors or IO time-outs or very poor performance.
So how do we prevent this from happening? Well, for starters make sure that your environment is clean. With this I mean you should make sure that all connectors are not exposed to dust or other types of contamination. Secondly try to handle cables with care. I’ve seen many cases where cables were under so much tension that Jimmy Hendrix would be able to compose one of his finest works on it. Although modern fibre cables are fairly rugged and are able to handle a fair amount of tension try not to test this. At a last bullet point I would suggest to keep an eye out on light emitting power ratios. As you most likely know lasers do not have an infinite lifetime and their transmission power will decrease over time. At some point in time the receiving end of a link is most likely no longer able to distinguish between on or off in a reliable manner and as such the 8b10b (or 64b/66b) encoding/decoding algorithm will start to detect bit flips and as such it will discard a transmission word. The upper and lower power requirements are published in the data-sheets so as soon as one of these values reach their lower values replace them.

Now you might argue that if you have 10000 ports in your fabric you might have other things to worry about than checking SFP power values every day. The stress put on storage admins is not decreasing the last time I looked so this will most likely not be the case for the years to come.

Fortunately you don’t have to. Both Brocade and Cisco provide option to monitor each individual component. For many years Brocade has one of the best embedded management tools there is namely Fabric Watch (FW). FW is not an active management tool per-se however the underlying goal is to have a sort of self-healing and protecting framework to monitor, alert and take action on events that might have implications on overall fabric behaviour.

A single dodgy link can have significant implications on overall fabric behaviour which can, and will, impact many hosts depending on topology and traffic pattern. FW allows you to set thresholds on many items in a switch from SFP power values, link errors, temperature readings etc etc. Each of these items can be configured with certain characteristics like above,below,in-between or change values. On each of these a time frame can be configured.

Now lets take an example on a link that has some intermittent errors. Your applications tolerate a certain error ratio per time-frame that they can recover from so in case on or two IO errors per hour are seen by the OS or application it will re-send the read or write command and all is good. If however, this starts to increase you might end up with the application going down or even data corruption. If you have configured FW to send a notification in case the amount of errors increase beyond the application tolerance, you will be able to take some action and investigate were the problem might be.

Now there is another issue and that is that you’re most likely not sitting behind a console 24×7 or monitoring emails during your holidays. So even if you do get notified there is a good chance you will not notice it. (I know I won’t when I’m playing golf :-))
These call for some more drastic measures and this is also covered by FW. If a certain threshold increases beyond a warning level and reaches a critical level FW allows you to take some action right away. This is a feature Brocade call port-fencing. Basically what it means is that this threshold is met it will just disable the port to prevent it from propagating the problems further up in the fabric. This is REALLY an area you SHOULD investigate. It can save you from having many issues showing up all over the fabric.

The title of this blog post is unfortunately the status as it now stands with most of the installed base of fabrics and the reason seems to be that administrators have a problem with software deciding on disruptive actions like disabling ports. My argument is that this port is already in a degraded state plus it also causes other links in the entire fabric having problems. If you don’t know what your looking for and have this large 10000 port fabric it will take you a significant amount of time before you know what’s going on. In this time many, many more hosts and applications can and will suffer from significant performance and other problems which might create some significant overtime for many people.

Regards,
Erwin

Save money managing storage effectively

How much tools do you use to manage your storage environment.

On average the storage admin uses 5 tools to manage a storage infrastructure.

1. Host tools (for getting host info like used capacity, volume configs etc.)
2. HBA tools (some OS’es don’t have a clue around that)
3. Fabric tools (extremely important)
4. Array tools (even more important)
5. Generic tools (for getting some sort of consolidated overview. Mainly Excel worksheets :-))


Sometime storage management is performed like below:

As you can see thing can become quite complicated when storage infrastructures grow and you’ll need a bigger whiteboard. At the point you have an enterprise storage infrastructure you’ll probably need a bigger building and a lot more whiteboards. 🙂

So what is the best way?

One word:

Integration, Integration, Integration, Integration.

The database boys know this for a long time. Don’t store the same information twice. This is called Database Normalization.
The same thing applies to storage management tools. Make sure that you use tools that have an integrated framework which leverages as much components as possible.

In case you’re using Hitachi kit is pretty easy. Their entire Hitachi Storage Command Suite works together and share single configuration repositories. The best thing is they do that even across their entire array product line from SMS to USP-V and even from two generations ago (so that includes the 9900 and 9500 series) This way other modules can make use of this. The other benefit is that you only have to deploy single host agents to obtain host info like volumes, filesystems, capacity usage etc and have that shared across all different products. Now be aware there is no silver bullet for managing all storage from a single pane of glass if you have a heterogenious environment. Every vendor has it own way of doing things and although the SNIA is making good progress with SMI-S it’s still lacking much of the nifty features storage vendors have released lately.