Tag Archives: risk

Why convergence still doesn’t work and how you put your business at risk

I browsed through some of the great TechField Day videos and came across the discussion “What is an Ethernet Fabric?” which covered the topic of Brocade’s version of a flat layer 2 Ethernet network based on their proprietary “ether-fabric protocol”. At a certain point the discussion led to the usual “Storage vs. Network” and it still seems there is a lot of mistrust between the two camps. (As rightfully they should. :-))

For the video of the “EtherFabric” discussion you can have a look >>here<<


Convergence between storage en networking has been a wishful thinking ever since parallel SCSI became in it 3rd phase where the command set was separated from the physical infrastructure and became serialised over an “network” protocol called Fibre-Channel.

The biggest problem is not the technical side of the conversion. Numerous options have already been provided which allow multiple protocols being transmitted via other protocols. The SCSI protocol is able to be transmitted via FibreChannelC, TCPIP, iSCSI and even the less advanced protocol ATA can be transferred directly via Ethernet.

One thing that is always forgotten is the intention of which these different networks were created for. Ethernet was developed somewhere in the 70’s by Robert Metcalf at Xerox (yes, the same company who also invented the GUI as we know it today) to be able to have two computers “talk” to each other and exchange information. Along that path the DARPA developed TCP/IP protocol was bolted on top of that to make sure there was some reliability and a more broader spectrum of services including routing etc was made possible. Still the intention has always been to have two computer systems exchange information along a serialised signal.

The storage side of the story is that this has always been developed to be able to talk to peripheral devices and these days the dominant two are SCSI and Ficon (SBCCS over FibreChannel). So lets take SCSI now. Just the acronym already tells you its intent:  Small Computer Systems Interface. It was designed for a parallel bus, 8-bits wide, had a 6 meter distance limitation and could shove data back and forth at 5MB/s. By the nature of the interfaces it was a half-duplex protocol and thus a fair chunk of time was spent on arbitration, select, attention and other phases. At some point in time (parallel) SCSI just ran into brick wall w.r.t. speed, flexibility, performance, distance etc. So the industry came up with the idea to serialise the dataflow of SCSI. In order to do this all protocol standards had to be unlinked from the physical requirements SCSI had always had. This was achieved with SCSI 3. In itself it was nothing new however as of that moment it was possible to bolt SCSI onto a serialised protocol. The only protocols available at that time were Ethernet, token ring, FDDI and some other niche ones. These ware all considered inferior and not fit for the purpose of transporting a channel protocol like SCSI. A reliable, high speed interface was needed and as such FibreChannel was born. Some folks at IBM were working on this new serial transport protocol which had all the characteristics anyone would want in a datacentre. High speed (1Gbit/s, remember Ethernet at that time was stuck at 10Mb/s and token ring at 16Mb/s), both optical and copper interfaces , long distance, reliable (ie no frame drop) and very flexible towards other protocols. This meant that FibreChannel was able to carry other protocols, both channel and network including IP, HIPPI, IPI, SCSI, ATM etc. The FC4 layer was made in such a flexible way that almost any other protocol could easily be mapped onto this layer and have the same functionality and characteristics that made FC the rock solid solution for storage.

So instead of using FC for IP transportation in the datacentre some very influential vendors went the other way around and started to bolt FC on top of Ethernet which resulted in the FCoE standard. So we now have a 3 decade old protocol (SCSI) bolted on top of a 2 decade old protocol (FC) bolted on top of a 4 decade old protocol (Ethernet).

This in al increases the complexity of datacentre design, operations, and troubleshooting time in case something goes wrong. Although you can argue that costs will be reduced due to the fact you only need single CNA’s, switchports etc instead of a combination of HBA’s and NIC’s, but think about the fact you lose that single link. This means you will lose both (storage and network) at the same time. This also means that manageability is reduced to zero and you will to be physically behind the system in order resuscitate it again. (Don’t start you have to have a separate management interface and network because that will totally negate the argument of any financial saving)

Although it might seem that from a topology perspective and the famous “Visio” drawings the design seems more simplified however when you start drawing the logical connections in addition to the configurable steps that are possible with a converged platform you will notice that there is a significant increase in connectivity. 

I’m a support engineer with one of the major storage vendors and I see on a day to day basis the enormous amount of information that comes out of a FibreChannel fabric. Whether it’s related to configuration errors, design issues causing congestion and over-subscription, bugs, network errors on FCIP links and problems with the physical infrastructure. See this in a vertical  way were applications, operating systems, volume managers, file-systems, drivers etc. all the way to the individual array spindle can be of influence of the behaviour of an entire storage network and you’ll see why you do not want to duplicate that by introducing Ethernet networks in the same path as the storage traffic.
I’m also extremely surprised that during the RFE/RFP phase for a new converged infrastructure almost no emphasis is placed on troubleshooting capabilities and knowledge. Companies hardly question themselves if they have enough expertise to manage and troubleshoot such kind of infrastructures. Storage networks are around for over over 15 years now and still I get a huge amount of questions which touch on the most basic knowledge of these networks. Some call themselves SAN engineers however they’ve dealt with this kind of equipment less than 6 months and the only thing that is “engineered” is the day-to-day operations of provisioning LUNs and zones. As soon a zone commit doesn’t work for whatever reason many of them are absolutely clueless and immediate support-cases are opened. Now extrapolate this and include Ethernet networks and converged infrastructures with numerous teams who manage their piece of the pie in a different manner and you will, sooner or later, come to the conclusion that convergence might seem great on paper however there is an enormous amount of effort that goes into a multitude of things spanning many technologies, groups, operational procedures and others I haven’t even touched on. (Security is one of them. Who determines which security policies will be applied on what part of the infrastructure. How will this work on shared and converged networks?)

Does this mean I’m against convergence? No, I think it’s the way to go as was virtualization of storage and OS’es. The problem is that convergence is still in its infancy and many companies who often have a CAPEX driven purchase policy are blind to the operational issues and risks. Many things need to be fleshed out before this becomes real “production ready” and the employees who keep your business-data on a knifes-edge are knowledgeable and confident they master this to the full extent.

My advice for now:

1. Keep networks and storage isolated. This improves spreading of risk, isolates problems and recoverability in case of disasters.
2. Familiarise yourself with these new technologies. Obtain knowledge through training and provide your employees with a lab where they can do stuff. Books and webinars have never been a good replacement for one-on-one instructor led training.
3. Grow towards an organisational model where operations are standardised and each team follows the same principles.
4. Do NOT expect you suppliers to adopt or know these operational procedures. The vendors have thousands of customers and a hospital requires far different methods of operations than an oil company. You are responsible for your infrastructure and nobody else. The support-organisation of you supplier deals with technical problems and they cannot fix your work methods. 
5. Keep in touch with where the market is going. What looks to become mainstream might be obsolete next week. Don’t put your eggs in one basket.


Once more, I’m geek enough to adopt new technologies but some should be avoided. FCoE is one of them at this stage.


Hope this helps a bit in making you decisions.

Comments are welcome.

Regards,
Erwin van Londen

Beyond the Hypervisor as we know it

And here we are again. I’ve busy doing some internal stuff for my company so the tweets and blogs were put on low maintenance.

Anyway, VMware launched its new version of vSphere and the amount of attention and noise it received is overwhelming both from a positive as well as negative side. Many customers feel they are ripped off by the new licensing schema whereas from a technical perspective all admins seem to agree the enhancements being made are fabulous. Being a techie myself I must say the new and updated stuff is extremely appealing and I can see why many admins would like to upgrade right away. I assume that’s only possible after the financial hurdles have been taken.

So why this subject? “VMware is not going to disappear and neither does MS or Xen” I hear you say. Well, probably not however let take a step back why these hypervisors were initially developed. Basically what they wanted to achieve is the option to run multiple applications on one server without having any sort of library dependency which might conflict and disturb or corrupt another application. VMware hasn’t been the initiator of this concept but the birthplace of this all was IBM’s mainframe platform. Even back in the 60’s and 70’s they had the same problem. Two or more applications had to run on the same physical box however due to conflicts in libraries and functions IBM found a way to isolate this and came up with the concept of virtual instances which ran on a common platform operating system. MVS which later became OS/390 and now zOS.

When the open systems guys spearheaded by Microsoft in the 80’s and 90’s took off they more or less created the same mess as IBM had seen before. (IBM did actually learn something and pushed that into OS/2 however that OS never really took off).
When Microsoft came up with so called Dynamic Link Libraries this was heaven for application developers. They could now dynamically load a DLL and use its functions. However they did not take into account that only one DLL with a certain function could be loaded as any one particular point. And thus when DLL got new functionality and therefore new revision levels sometimes they were not backward compatible and very nasty conflict would surface. So we were back to zero.

And along came VMware. They did for the Windows world what IBM had done many years before and created a hypervisor which would let you run multiple virtual machines each isolated from each other with no possibility of binary conflicts. And they still make good money of it.

However also the application developers have not been pulling things out of their nose and sit still. They also have seen that they no longer can utilize the development model they used for years. Every self respecting developer now programs with massive scalability and distributed systems in mind based on cloud principles. Basically this means that applications are almost solely build on web technologies with javascript (via node.js), HTML 5 or other high level languages. These applications are then loaded upon distributed systems like openstack, hadoop and one or two others. These platforms create application containers where the application is isolated and has to abide by the functionality of the underlying platform. This is exactly what I wrote almost two years ago where the application itself should be virtualised instead of the operating system. (See here)

When you take this into account you can imagine that the hypervisors, as we know them now, at some point in time will render themselves useless. The operating system itself is not important anymore and is doesn’t matter where these cloud systems run on. The only thing that is important is scalability and reliability.  Companies like VMware, Microsoft, HP and others are not stupid  and see this coming. This is also the reason why they start building these massive data centres to accommodate the customers who adopt this technology and start hosting these applications.

Now here come the problems with this concept. SLA’s. Who is going to guarantee you availability when everything is out of your control. Examples like outages with Amazon EC2, Microsoft’s cloud email service BPOS, VMware’s Cloud Foundry outage or Google GMAIL service show that even these extremely well designed systems at some point in time run into Murphy and the question is do you want to depend on these providers for business continuity. Be aware you have no vote how and were your application is hosted. That is totally at the discretion of the hosting provider. Again, its all about risk assessment versus costs versus flexibility and other arguments you can think of so I leave that up to you.

So where does this take you? Well, you should start thinking about your requirements. Does my business need this cloud based flexibility or should I adopt a more hybrid model where some applications are build and managed by myself/my staff.

In any way you will see more and more applications being developed for both internal, external and hybrid cloud models. This then brings us back to the subject line that the hypervisors as we know them today will cease to exist. It might take a while but the software world is like a diesel train, it starts slowly but when it´s on a roll its almost impossible to stop so be prepared.

Kind regards,
Erwin van Londen

Maintenance

Why do I keep wondering why companies don’t maintain their infrastructure. It looks to be more of an exception than a rule to come along software and firmware which is newer than 6 months old. True, I admit, it’s a beefy task to keep this all lined up but then again you know this in advance so why isn’t there any plan to get this sorted every 6 months or even more frequent.


In my day-to-day job I see a lot of information from many customers around the world. Sometimes during implementation phases there is a lot of focus on interoperability and software certification stacks. Does Server A with HBA B work with switch Y on storage platform Z? This is very often a rubber stamping methodology which goes out the door with the project team. The moment a project has been done and dusted this very important piece is very often neglected due to many reasons. Most of them are time constraints, risk aversion, change freezes, fear something goes wrong etc. However when you look at the risk businesses take by taking chances not properly maintaining their software is like walking on a tightrope over the Grand Canyon with wind-gusts over 10 Beaufort. You might get a long way but sooner or later you will fall.

Vendors do actually fix things although some people might think otherwise. Remember in a storage array are around 800.000 pieces of hardware and a couple of million lines of software which make this thing run and store data for you. If you compare that to a car and would run the same maintenance schedule you’re requiring the car to run for 120 years non-stop without changing oil, filters, tyres, exhaust etc.etc. So would you trust your children in such a car after 2 years or even after 6 months. I don’t, but still businesses do seem to take the chances.

So is it fair for vendors to ask (or demand) you to maintain your storage environment. I think it is. Back in the days when I had my feet wet in the data-centers (figuratively speaking that is) once I managed a storage environment of around 250 servers, 18 FC switches and 12 arrays so a pretty beefy environment in those days. I’d set myself a threshold for firmware and drivers not to exceed a 4 month lifetime. That meant that if newer code came out from a particular vendor it was impossible that code was not implemented before those 4 months were over.
I spent around two days every quarter to generate the paperwork with change requests, vendor engineers appointments etc and 2 two days to get it all implemented. Voilà done.The more experience you become in this process the better and faster it will be done.

Another problem is with storage service providers, or service providers in general. They also are depending on their customers to get all the approvals stamped, signed and delivered which is very often seen a huge burden so they just let it go and eat the risk of something going wrong. The problem is that during RFP/RFI processes the customers do not ask for this, the sales people are not interested since this might become a show-stopper for them and as such nothing around the ongoing infrastructure maintenance is documented or contractually written in delivery or sales documents.

As a storage or service provider I would turn this “obligation” in my advantage and say: Dear Mr, customer, this is how we maintain our infrastructure environment so we, and you, are assured of immediate and accurate vendor support, have up-to-date infrastructure to minimize the risk of anything going downhill at some point in time.”

I’ve seen it happen were high severity issues with massive outages were logged with vendors where those vendors came back and say “Oh yes Sir, we know of this problem and we have fixed that a year ago but you haven’t implemented this level of firmware, it is far outdated”.

How would that look if you’re a storage/service provider to your customers or a bank who’s online banking system had some coverage in the newspapers??

Anyway, the moral is “Keep it working by maintaining your environment”.

BTW, at SNWUSA Spring 2011 I wrote a SNIA tutorial “Get Ready for Support” which outlines some of the steps you need to take before contacting your vendors’ support organisation. Just log on to www.snia.org/tutorials. It will be posted there after the event.

Cheers,
Erwin van Londen