My Philosophy on Alerting - Google Docs

Writeup from a Google SRE on alerting/monitoring. Very well thought out.

Pages should be urgent, important, actionable, and real.

  • emphasis on reducing noise levels
  • emphasis on end-to-end, black box, symptom-based alerting rather than the cause (I assume there is still enough monitoring/metrics in place to quickly diagnose the cause from the symptom)
  • a daily report can be a good channel for non-critical, but time-sensitive alerts, particularly on causes, (disk getting relatively full, unusually large numbers of slow queries, etc)
  • “Every alert should be tracked through a workflow system.” not just dumped into an IRC channel or email list.

This is good for thinking about Hound. Overall, a lot of effort has gone into making all of Hound’s alerts be “urgent, important, actionable, and real” but some fall short. Eg, quite a few currently exist that aren’t really actionable (eg, monitoring of various LITO services, Wardenclyffe -> PCP failures), that we have because we’d just rather know when something we depend on fails before our users.

Things to consider adding to Hound based on this:

  • dependency chain: link symptoms to causes so we can silence the symptom alerts when we know the cause
  • different alert targets. So we can set up alerts that only go to the people who can actually act on them, rather than dump everything to ccnmtl-sysadmin and train people to ignore a lot of them (“somebody else’s problem”).

ccnmtl devops