Nice title especially if you’re in the lumberjack business. 🙂 All kidding aside. Support engineers rely 100% on these things and if we’re asking for them please don’t hesitate to provide them.
It has been ventilated a fair amount of time that some people see this as some sort of “delay-tactic” from support engineers but nothing can be farther from the truth. Main reason is that we don’t get any benefit anyway, it doesn’t solve your problem nor does it dismiss us from providing you a timely solution to your problem..
“So what is the reason you keep asking for more logs?” you’d say. Well, before I can answer that you need to be aware that the most interesting part of a system-log is the current actual status of a system. That information is static and can only be obtained by either manually executing commands or providing system-dumps (bit of a short term for a collection of information out of a certain machine whether being a server, switch, storage array etc…) on a regular basis with certain intervals. That provides the support-engineer with a set of counters and values which may indicate if any processes or equipment status changes overtime.
Remember that counters and values provide a point in time status. In contrast to logs that provide a sequence of events that happened in the past with most often (hopefully) a time-stamp. Counter values however do not show when they have been accumulated and as such a new base-line is required in order to be able to establish if a certain piece of equipment or process is subject to issues.
As an example when you execute “porterrshow” on a Brocade switch it gives you an output like
frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err 192: 5.8m 428.7k 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 207: 3.6g 40.1k 0 0 0 0 0 0 0 0 0 83.6k 0 0 0 0 0 0
Does this mean that port 207 is currently bad? I don’t know and I can’t tell. It may well be that the 83.6k times a sync error was observed happened 6 months ago. If no baseline is formed on a certain time-stamp you cannot determine if this is still an issue. The exception is if these errors are coinciding with events being logged that indicate certain thresholds have been crossed. Only then it is safe to assume there is something wrong.
A similar example is when a case is opened because of poor performance. The problem is that the definition of “poor” cannot be determined. Has the performance ever been good, when did that change, why did it change, were certain actions contributing to the problem etc. If a CPU reports to be 80% busy is that good or bad. We in support cannot tell as we don’t have a history of the system and this busy-rate might be normal compared to the baseline of the system. Looking at it with a blind eye would give you the feeling there is a performance issue but actually you can’t tell.
The second question we most often get is a request for a RCA (Root Cause Analysis). Depending on the actual problem this may sometimes require engineering resources if some weird internal code-bug is playing up. When it comes to finding a fault in operations and/or management of the equipment the information needs to be there so before doing anything treat the equipment/system as a crime-scene and first collect all logs before you do anything else. Time is also of the essence here as many logs, especially in embedded systems, are circular, basically meaning they hold a certain number of entries before it wraps around and started deleting the oldest ones again. If that happens the information is gone and no RCA can be provided. What we’ve seen fairly often is that administrators get very creative and cook their own shell/perl/python/go/etc.. scripts and execute commands against a device whether that is a switch or some other piece of equipment. This not only generates an uncontrollable load on the limited CPU power of these systems but it will also ensure that audit logs wrap very quickly leading to the fact it is impossible to tell who did what at what time on which date.
Support engineers are not on a sales-commission so we do not benefit directly from the amount of money you pay for the equipment and/or services. As most of the technical guys it’s the overall company performance that is indicative of what we do and we get measured by customer satisfaction. We have no reason whatsoever to use some sort of “delay-tactic” towards you or anyone else. We don’t gain anything by it and it would only be incurring a negative perception from you. Neither of us would see this as beneficial obviously.
Be aware that certain circumstances may impose delays. I’ve handled cases where the severity was set at 1 only to be countered by the fact that the equipment had long passed the “End-of-Support” date. Other examples may be that equipment is out of normal warranty and no longer under maintenance from the vendor you purchased it from.
As supportcases are raised they come with a certain expectation. You want to get contacted asap and know your case is being worked on. As soon as one of the engineers are appointed to help resolve it what we often see is that massive amounts of information are dumped into the cases with logs from every switch in your SAN as well as host-dumps, array dumps etc etc. In addition to that we also get case descriptions that sometimes look like the expectation is that we know your entire infrastructure. Things like “We have an issue on our Oracle cluster and would like to get support on this” without any further explanation what systems comprise this Oracle cluster, how is it connected network-wise, SAN-wise etc etc. It takes a long time to actually figure out what is going on and get a bit of a feeling what the problem as well as being able to obtain a somewhat visualization of your environment.
The better your description is, including time-lines, actions, side-effects etc etc., the quicker support can make an assessment on where to look and bypass the non-important information. Otherwise the old “needle and haystack” is going to play up again.
One other things would be to get timely feedback as well. Sometimes an actionplan is provided with detailed steps on how to tackle a certain issue after which all communications seems to die and no response is received any more on questions support engineers might have. So in some cases due to no response they’re more or less inclined to believe the problem is resolved and close it after which a surprised question arrives why the case was closed. So in short please be aware that support-engineers would like to be kept in the loop even if it pertains organizational things like scheduling maintenance windows etc. At least they then know if, when and how to pay attention to the case. As the saying goes “It takes two to tango” and a support engagement is no different.
Experience and manuals
The majority of support-engineers have a fair amount of run-time under their belt. Often you can see this visually by them having grey beards or hair. It’s not always clear if that is due to age or simply working in a support-role. (:-)). The expertise that has been build over a few decades also brings a lot of “gut-feeling” basically meaning that if they see something is wrong but does not really match the description or problem area for with the case was opened I would still pay attention to what they say and follow their guidelines even though sometimes it might look somewhat strange.
Reading a manual should be someting that every sysadmin should see as second nature. If you are on a plane to a conference of some sort or are on a train as part of your daily commute to work simply grab a manual of systems that you are responsible for. Whether this is an administration manual or release notes from a new version it will always be helpful. Much more than your daily Facebook and Twitter feeds. Also in this case the support organisation from the respective vendors do not have the bandwidth nor resources to provide “How-To” action plans for your environment. This would really be your responsibility.
Furthermore no vendor has an infinite amount of people resources in support. The amount of time they can spend on the issue is to the extent to where that issue is resolved. Requests of having a review for a architectural-design or provide customised code for operational management is not something they can help with. Each vendor (or at least most of them) have some sort of professional services department who often can be of assistance.
I hope I made it a bit clear why the support organisation often request for seemingly the same information numerous times and also why the actual support start with you as a sys-admin.
As Tom Cruise said in the movie “Jerry Maguire“… “Help me to help you” and nothing is more true than in support.