Ughh, upgrades, maintenance, risk of downtime, out of hours work and pulling all-nighters or even weekends when your wife and kids start wondering who that strange man is that sometimes shows his face through the back-door.
But what made you decide to do the upgrades on your OS, firmware, applications etc.
When you work in support as I do there are bucket-loads of cases opened from customers and partner who run into certain problems only to find out that the issue that has been hit was already fixed in release x-15 from 3 years ago.
From experience it seems that maintenance runs are depending on two factors.
- Something already has gone seriously wrong and panic-mode sets in or
- Some security issue that gets highlighted in the press with certain software triggers nervous breakdown inside all levels of an organization and emergency fixes/patches need to be applied asap.
The ones where upgrades are scheduled on a regular basis do seem be spared of both and only very specific circumstances result in a unknown bug being hit resulting in a controlled roll-back or quick enough fix. The argument is running a certain version of X-1 release is nonsense as you already know that the version you are installing has bugs in them that already have been fixed. It’s a bit like renewing you car-tires with ones that have ran 25000 kilometers instead of 35000.
So how do you make decisions on which version you use. That leads me to the documentation called “Release Notes”.
Now release notes are only as good as the detail they describe. New functions and features of software should be highlighted in but not described in detail. That’s where we have manual for so actually referring to the proper documentation is very helpful. Support for new hardware should also be listed including on how to determine which hardware (and specific revision of hardware) you use or are going to use. Depreciation of support for old hardware (also including revisions) obviously should be outlined as well.
Next is dependencies. No piece of equipment, OS, firmware etc in an IT infrastructure is isolated from others. Expectation is that the majority of things should simply work with each other especially when you are utilising protocols and standards that are well established in the industry. Obviously there are many advances in our field of work and so to do interoperability tests for everything that is out there is impossible. Determining if browser X on operating system Y in a 32 bit architecture with SSL libraries 1.2 might work with java version ABC might not seem that difficult but if you have to put every single combination on the test bench this will become a very daunting task and is basically undo-able. So for that matter only a subset of combinations are tested with versions/releases that are commonly available at that moment in time.
That bring me to BUGS. No not the bunny. Software defects are a given just as much as the sun comes up every morning and sets every evening. (Ok, Ok, not counting Arctic circles but you get the drift). These defects are found, tracked, traced, fixed and rolled out in an ongoing cycle of releases. That also means that if you are running software X-15 and all of a sudden you see a bug or security vulnerability that is applicable to your environment you may not be able to upgrade to version X right away. In almost all cases you will be enforced to sequentially upgrade with every intermediate version that is out there and that will then become a very labour-intensive task.
So how do you determine if your version is subject to one or more bugs or not. That is also described in the Release Notes of each version.
Or is it????
Unfortunately the description of bugs and software irregularities is appalling (if your politically correct you might say “Has room for improvement”). There are mainly two reasons for that.
- Software developers are hired for writing software, not documentation and
- Much of the issues are touching on proprietary information which the majority of companies don’t allow to be disclosed.
You then get these utterly useless descriptions like the following. (Now this happens to be from a Brocade FOS release note but I can line up 100rds of bug descriptions like this from all other vendors pertaining a vast amount of software so don’t think you’re off the hook with any other…).
Defect ID: DEFECT000631314
Technical Severity: High
Product: Brocade Fabric OS
Technology Group: Management
Reported In Release: FOS8.0.1
Technology: Fibre Channel Services
Symptom: When a device comes online and logs in, it is unable to communicate to other devices on the same switch.
Condition: This occurs when device logs in with conflicting VVL (Vendor Version Level) version information.
Workaround: Download latest driver for the device.
Recovery: Download latest driver for the device.
Not very helpful is it? The conditions should describe how to identify if you really hit this issue. You may have made a zoning error resulting in the same phenomenon, you may have configured lun-masking incorrectly in the HBA driver or made an error in some udev rule(s) preventing a lun being mounted or properly discovered.
The severity and probability also are subject to more than one clause. If this is related to a device of which you have only one and that one is already schedule to be removed from the environment the severity is negligible and the probability is low indeed. If by chance you have 1000 of “these devices” in your environment and they are responsible for driving you multi-billion-$$-a-day business not only is you probability significantly elevated but the severity may also go through the roof. The “Conditions” clause should in that case be very descriptive to be able to determine that probability.
Then the “Reported in release” field. Nothing is more useless than this. It basically doesn’t tell you if you’re running 2 versions older you are excluded or not. The issue might have been lingering for a long time but only it got flagged when it happened to be on that particular version. That field defect should have a full list of all versions that are subject to this issue.
It is not that the developers are unwilling as when you get the chance of having an 101 with them they are most often happy to explain in detail what’s going on and how they resolved it.
So for this example how do you determine if you really hit this case?
Short answer: “You can’t” as the conditions that are mentioned don’t tell you if “the device” in question is the same as you have nor does it tell you which version of driver the problem shows or where to download the fixed version. It also does not tell how to determine if you actually hit this problem or not.
The above is just an example. The exception is often in software that is open-source and the bug-reporting and handling is done in open platforms where you can actually see who is doing what including all contextual information around it and associated GIT/Subversion commits are linked to it.
So what is moral of this post. Even if you cannot 100% determine if you may be subject to one or more bugs and the release notes are not clear you are much safer off implementing a rigorous maintenance upgrade schedule. In 25 years of data-center maintenance, system administration, technical sales as well as engineering-support I’ve seen a fair few cases where things might go wrong with upgrades but not nearly as much as when entire environments go down due to old software. Getting to a current level from the stone-age is much more painful then a quarterly review and maintenance run.
Unfortunately the Devops mantra of CI/CD is not possible on hardware based platforms but given the fact that the majority of vendors are prepping their equipment to enable this it is very wise to already plan for such a mindset.
Get to it.