I browsed through some of the great TechField Day videos and came across the discussion “What is an Ethernet Fabric?” which covered the topic of Brocade’s version of a flat layer 2 Ethernet network based on their proprietary “ether-fabric protocol”. At a certain point the discussion led to the usual “Storage vs. Network” and it still seems there is a lot of mistrust between the two camps. (As rightfully they should. :-))
For the video of the “EtherFabric” discussion you can have a look >>here<<
Convergence between storage en networking has been a wishful thinking ever since parallel SCSI became in it 3rd phase where the command set was separated from the physical infrastructure and became serialised over an “network” protocol called Fibre-Channel.
The biggest problem is not the technical side of the conversion. Numerous options have already been provided which allow multiple protocols being transmitted via other protocols. The SCSI protocol is able to be transmitted via FibreChannelC, TCPIP, iSCSI and even the less advanced protocol ATA can be transferred directly via Ethernet.
One thing that is always forgotten is the intention of which these different networks were created for. Ethernet was developed somewhere in the 70’s by Robert Metcalf at Xerox (yes, the same company who also invented the GUI as we know it today) to be able to have two computers “talk” to each other and exchange information. Along that path the DARPA developed TCP/IP protocol was bolted on top of that to make sure there was some reliability and a more broader spectrum of services including routing etc was made possible. Still the intention has always been to have two computer systems exchange information along a serialised signal.
The storage side of the story is that this has always been developed to be able to talk to peripheral devices and these days the dominant two are SCSI and Ficon (SBCCS over FibreChannel). So lets take SCSI now. Just the acronym already tells you its intent: Small Computer Systems Interface. It was designed for a parallel bus, 8-bits wide, had a 6 meter distance limitation and could shove data back and forth at 5MB/s. By the nature of the interfaces it was a half-duplex protocol and thus a fair chunk of time was spent on arbitration, select, attention and other phases. At some point in time (parallel) SCSI just ran into brick wall w.r.t. speed, flexibility, performance, distance etc. So the industry came up with the idea to serialise the dataflow of SCSI. In order to do this all protocol standards had to be unlinked from the physical requirements SCSI had always had. This was achieved with SCSI 3. In itself it was nothing new however as of that moment it was possible to bolt SCSI onto a serialised protocol. The only protocols available at that time were Ethernet, token ring, FDDI and some other niche ones. These ware all considered inferior and not fit for the purpose of transporting a channel protocol like SCSI. A reliable, high speed interface was needed and as such FibreChannel was born. Some folks at IBM were working on this new serial transport protocol which had all the characteristics anyone would want in a datacentre. High speed (1Gbit/s, remember Ethernet at that time was stuck at 10Mb/s and token ring at 16Mb/s), both optical and copper interfaces , long distance, reliable (ie no frame drop) and very flexible towards other protocols. This meant that FibreChannel was able to carry other protocols, both channel and network including IP, HIPPI, IPI, SCSI, ATM etc. The FC4 layer was made in such a flexible way that almost any other protocol could easily be mapped onto this layer and have the same functionality and characteristics that made FC the rock solid solution for storage.
So instead of using FC for IP transportation in the datacentre some very influential vendors went the other way around and started to bolt FC on top of Ethernet which resulted in the FCoE standard. So we now have a 3 decade old protocol (SCSI) bolted on top of a 2 decade old protocol (FC) bolted on top of a 4 decade old protocol (Ethernet).
This in al increases the complexity of datacentre design, operations, and troubleshooting time in case something goes wrong. Although you can argue that costs will be reduced due to the fact you only need single CNA’s, switchports etc instead of a combination of HBA’s and NIC’s, but think about the fact you lose that single link. This means you will lose both (storage and network) at the same time. This also means that manageability is reduced to zero and you will to be physically behind the system in order resuscitate it again. (Don’t start you have to have a separate management interface and network because that will totally negate the argument of any financial saving)
Although it might seem that from a topology perspective and the famous “Visio” drawings the design seems more simplified however when you start drawing the logical connections in addition to the configurable steps that are possible with a converged platform you will notice that there is a significant increase in connectivity.
I’m a support engineer with one of the major storage vendors and I see on a day to day basis the enormous amount of information that comes out of a FibreChannel fabric. Whether it’s related to configuration errors, design issues causing congestion and over-subscription, bugs, network errors on FCIP links and problems with the physical infrastructure. See this in a vertical way were applications, operating systems, volume managers, file-systems, drivers etc. all the way to the individual array spindle can be of influence of the behaviour of an entire storage network and you’ll see why you do not want to duplicate that by introducing Ethernet networks in the same path as the storage traffic.
I’m also extremely surprised that during the RFE/RFP phase for a new converged infrastructure almost no emphasis is placed on troubleshooting capabilities and knowledge. Companies hardly question themselves if they have enough expertise to manage and troubleshoot such kind of infrastructures. Storage networks are around for over over 15 years now and still I get a huge amount of questions which touch on the most basic knowledge of these networks. Some call themselves SAN engineers however they’ve dealt with this kind of equipment less than 6 months and the only thing that is “engineered” is the day-to-day operations of provisioning LUNs and zones. As soon a zone commit doesn’t work for whatever reason many of them are absolutely clueless and immediate support-cases are opened. Now extrapolate this and include Ethernet networks and converged infrastructures with numerous teams who manage their piece of the pie in a different manner and you will, sooner or later, come to the conclusion that convergence might seem great on paper however there is an enormous amount of effort that goes into a multitude of things spanning many technologies, groups, operational procedures and others I haven’t even touched on. (Security is one of them. Who determines which security policies will be applied on what part of the infrastructure. How will this work on shared and converged networks?)
Does this mean I’m against convergence? No, I think it’s the way to go as was virtualization of storage and OS’es. The problem is that convergence is still in its infancy and many companies who often have a CAPEX driven purchase policy are blind to the operational issues and risks. Many things need to be fleshed out before this becomes real “production ready” and the employees who keep your business-data on a knifes-edge are knowledgeable and confident they master this to the full extent.
My advice for now:
1. Keep networks and storage isolated. This improves spreading of risk, isolates problems and recoverability in case of disasters.
2. Familiarise yourself with these new technologies. Obtain knowledge through training and provide your employees with a lab where they can do stuff. Books and webinars have never been a good replacement for one-on-one instructor led training.
3. Grow towards an organisational model where operations are standardised and each team follows the same principles.
4. Do NOT expect you suppliers to adopt or know these operational procedures. The vendors have thousands of customers and a hospital requires far different methods of operations than an oil company. You are responsible for your infrastructure and nobody else. The support-organisation of you supplier deals with technical problems and they cannot fix your work methods.
5. Keep in touch with where the market is going. What looks to become mainstream might be obsolete next week. Don’t put your eggs in one basket.
Once more, I’m geek enough to adopt new technologies but some should be avoided. FCoE is one of them at this stage.
Hope this helps a bit in making you decisions.
Comments are welcome.
Erwin van Londen