Backblaze uses SMART 5, 187, 188, 197 and 198 for determining the failure or potential failure of a hard drive.
pull down environment variables from etcd and run a process with them
“only way to stop is to crash. only way to start is to recover”
Talk by Google’s SRE Czar.
Writeup from a Google SRE on alerting/monitoring. Very well thought out.
Pages should be urgent, important, actionable, and real.
This is good for thinking about Hound. Overall, a lot of effort has gone into making all of Hound’s alerts be “urgent, important, actionable, and real” but some fall short. Eg, quite a few currently exist that aren’t really actionable (eg, monitoring of various LITO services, Wardenclyffe -> PCP failures), that we have because we’d just rather know when something we depend on fails before our users.
Things to consider adding to Hound based on this: