Writeup from a Google SRE on alerting/monitoring. Very well thought out.
Pages should be urgent, important, actionable, and real.
- emphasis on reducing noise levels
- emphasis on end-to-end, black box, symptom-based alerting rather than the cause (I assume there is still enough monitoring/metrics in place to quickly diagnose the cause from the symptom)
- a daily report can be a good channel for non-critical, but time-sensitive alerts, particularly on causes, (disk getting relatively full, unusually large numbers of slow queries, etc)
- “Every alert should be tracked through a workflow system.” not just dumped into an IRC channel or email list.
This is good for thinking about Hound. Overall, a lot of effort has gone into making all of Hound’s alerts be “urgent, important, actionable, and real” but some fall short. Eg, quite a few currently exist that aren’t really actionable (eg, monitoring of various LITO services, Wardenclyffe -> PCP failures), that we have because we’d just rather know when something we depend on fails before our users.
Things to consider adding to Hound based on this:
- dependency chain: link symptoms to causes so we can silence the symptom alerts when we know the cause
- different alert targets. So we can set up alerts that only go to the people who can actually act on them, rather than dump everything to ccnmtl-sysadmin and train people to ignore a lot of them (“somebody else’s problem”).