web farm alerting when over 30% of servers fail

I have a web farm currently made up of about 7 servers, with round-robin load balancing.  I don't want alerts if just 1 or 2 servers fail, but if 3 or more go offline I want to create an incident in Cherwell.  We also add more servers during high traffic periods, so next month there could be over a dozen servers in the farm.  Any suggestions for how I can do this?

  • A way to keep alerts from just alerting if 1 or 2 servers fail but 3 or more is to setup an Advanced Event Definition.  This will allow you to choose the number of servers failing ping status, for example, and if the sum of all server ping evaluations are less than a defined value the event definition will fail.  This can then be tied to a notification that generates an incident in Cherwell.

     

    Here is a screenshot of an example of an Advanced Event Definition:

     

    In this scenario, the number of servers in the web farm are added as variables and the 'Ping Status' evaluations are either 1 or 0; where 0 is failed status and 1 is OK.  If the sum of these evaluations are less than three, then the event definition will fail.  This Advanced Event Definition can be scaled to add additional servers depending on high/low traffic.

  • In reply to DevinB:

    Great answer, another option is to use groupcheck operations to also count the number of pings, this can be blueprinted as well.