Preventing Information Overload in SAN Management

The standpoint of getting notified when events happen in SAN environments at all costs may need to be reviewed. As many of you know the Broadcom and Cisco switches can customise various thresholds in their respective FOS and NX-OS operating systems.

Brocade uses the MAPS framework which follows a policy based setup.

You have groups, categories and rules which are combined in a policy. You can have multiple policies configured but only one can be active at a time. This policy runs some sort of tumbler which acts on events that are happening in a switch or fabric and it one of them hit a rule the configured associated action is triggered. As you can see there are a few of them.

The problem

The main issue is that different events may happen during different times of the day. These events may cause issues during normal business hours but are expected and fully acceptable during other time-frames. When these circumstances happen that trigger these rules you may well see an enormous amount of event showing up in the eventlogs, monitoring applications, ticketing systems and other places which you basically don’t want.

The information overload problem will then result in mainly three things.

  1. You may be missing actual issues as they are buried somewhere in the rest of the haystack and
  2. Complacency kicks in which will at some stage result that high impact issues are either missed or ignored or
  3. The wrong conclusions are made based on incorrect correlation events.

An example

When a traffic pattern on an ISL shows an average usage of 40% during the day you can configure MAPS to alert you with a warning event when this reaches 60% and a critical event when it reaches 80%. That is a very valid configuration and allows you to keep a close eye on traffic patterns that are out of normal boundaries.

The observations during other times of the day may be very different where, for example, backup applications or database warehouse processes start kicking in. You can expect the warning and critical threshold to be crossed at various points during these workloads. Seeing events popping up every minute will for sure annoy you in the morning when you get back and you have to either acknowledge or delete each individual event. If such events are also set to open up tickets in your service handling tools you will for sure become grumpy. These are the moments where complacency starts as very often a “Yeah, Yeah… <Ctrl-A> > <Del>” keyboard sequence is seen as you simply don’t have the time to have a look at each and everyone of them. At some stage you will miss critical events which will negatively impact your response on these issues.

So what to do

There are a few things you can adjust.

  1. Create different policies, each containing its own set of rules and action items, which get activated during different times of the day.
  2. Modify rule-sets in place in existing policies and distribute these across the switches during different times of the day.
  3. Create so called “quiet times” in rules or globally to prevent event generation based on the same rule within a certain time-frame.

For all 3 you can make a business case but the complexity and maintenance associated on the first two make it somewhat of an administrative burden and mistakes in configuration between different policy configurations are easily made.

Based on the example I outlined before the easiest thing to do is to add a “-qt” parameter to the respective rules. This ensure you will only get notified when the rule first triggers and potentially again after the “quite time” has expired and the same condition is still triggered. The value of the “-qt” configuration is dependant on the rules time-base parameter. If a rule contains a trigger that says to monitor for a certain event or status once every hour that “-qt” value will also depict hours.

What’s the drawback

If the “-qt” parameter is set for example to suppress events for 8 hours you will also only be notified once during day times even when the actual event is ongoing. If for whatever reason you miss that first event you may also be late to the party in order to identify the problem on time.

Another option is to use Rules-on-Rules. These do not monitor events or statuses of the groups or object but more check on the sequence of other rules. This function act more as a filter and allows you to link different action items than on the base rule. You can for example have the base rule simply log a RASLOG event but have the RoR-rule do an SNMP-trap, email or syslog notification. That way you still can keep track of all events but do not get overloaded with continuous messages in your management systems.

Suggestion

If you have the time grab your vendor by the ears and have them file an RFE which would allow to have a rule parameter that only is active during certain times of the day, week, month in a similar fashion you would create firewall filters with the “-m time” parameter in a Linux iptables rule. This is currently not possible in MAPS and such time based filtering should now be done in the respective management platforms.

Have a look at the MAPS manual of FOS 9 to see what is possible and you’ll be surprised about the flexibility. Don’t get lost in the forest though as creating very complex MAPS policies easily results in conflicting rules, priorities etc. My mantra of KISS (Keep It Stupid Simple) is still valid. Monitor what you need to know, filter out the noise and act upon issues asap so they don’t disrupt anything else in the environment.

Hope this helps. Comments are welcome.

Regards,

Erwin

Print Friendly, PDF & Email

About Erwin van Londen

Master Technical Analyst at Hitachi Data Systems
Brocade, Uncategorized , , ,