Presentation Pt4: The art of service dependencies

Here we are at the final part of this little series. Throughout I’ve discussed my standard tactic for linking objects so that you can easily scale your solution and never get “stuck” with having to make large sweeping config modifications to implement small changes.

The below diagram is the complete overview of all those little sections joined together.

Complete

The arrow direction denotes who defines what, the start of the arrow is the object defining the connection. In the first article I specified that this layout is a fairly rough guideline that won’t be suitable for everyone and even if you do follow this guide line there are going to be exceptions to the rule.

It’s important that when these exceptions crop up that you:- 1. Ensure that the solution is easily re-usable if you need to make the same exception again. 2. Label it clearly. Nothing is more of a nightmare than trying to find that one unlabeled needle of an exception in the haystack of your configuration files.

The last things to cover are service-dependencies and parenting. The rules for parenting are simple: 1. Parent everything.

I can’t stress how important this is, not only will it prevent a horrific broadcast storm if you lose a core switch somewhere but it will also ensure that your Nagios server doesn’t kill itself trying to ascertain the status of a couple hundred devices while simultaneously trying to deliver hundreds of emails/SMS’s/event handlers/etc.

In most cases it’s really simple to do! To work out the parent of a device simply work out which device is the next hop towards Nagios, your network team should be able to tell you this no problem. What if there are two devices for redundant pathing? Nagios fully supports multi-tenancy for parenting… so it will happily accept two devices as the parent and as long as one device is available Nagios will assume it still has a valid path.

There is one exception to the “simple to do”… when you have a fully redundant network segment, i.e. a ring or mesh of switches that will direct traffic in either direction of the loop, preventing any clear path of lineage back to Nagios. For these scenarios you just have to bite the bullet and define one path as your primary path otherwise Nagios will detect a configuration loop and fail to load.

Last thing I want to touch on is what I call “indirectly monitored service”, what I mean by that is say you want to monitor the CPU of a virtual host… you don’t get the stats from the VM because it doesn’t have the true values… instead you get that information from the hyper-visor.

Chances are you don’t want to attach 200 CPU checks to the hyper-visor you want it associated with the host it relates to… hence the service is attached to the required host but checks are done indirectly via another device. With that in mind consider this scenario for VMWare ESX monitoring:

Good

This depicts the previously described scenario, but what happens if our vCenter server fails?

Bad

Spam. Spam is what happens if our vCenter server fails and nobody likes spam. The solution is rather simple, create a ping service for the vCenter server and then use service-dependencies to ensure that your CPU and memory checks are dependent on the server being alive. You could even go one step further and make it dependent on the API being contactable but lets keep it simple for now.

An example config for a service-dependency looks like so:

define service {
  host_name vSphereServer
  service_description Ping dependency
  use main-service-template
  check_command check_ping!100,80%!200,90%   
  register 1
}

define service {
  service_description CPU Usage
  use main-service-template
  hostgroup_name srv-v-windows
  check_command check_esx!CPU
  contact_groups cg-main
  register 1
}   

define servicedependency {
  dependent_hostgroup_name srv-v-windows
  dependent_service_description CPU Usage
  host_name vSphereServer
  service_description Ping dependency
  inherits_parent 1
  execution_failure_criteria w,u,c,p
  notification_failure_criteria w,u,c
  dependency_period 24x7
}

So there we are, I hope this series helps some one get a basic grasp on understanding the architectural intricacies of Nagios config design. Thanks for reading!

Links

Presentation Pt1: User Permissions

Presentation Pt2: Users and Contacts

Presentation Pt3: Hosts and Services

Presentation Pt4: The art of service dependencies

comments powered by Disqus