Channel: devops

Backblaze finds only a few particular SMART metrics useful for predicting and detecting hard drive failure

Backblaze uses SMART 5, 187, 188, 197 and 198 for determining the failure or potential failure of a hard drive.

devops linux sysadmin

lokalebasen/go-env

pull down environment variables from etcd and run a process with them

devops golang

Crash-only software: More than meets the eye [LWN.net]

“only way to stop is to crash. only way to start is to recover”

devops distributed systems

Keys to SRE

Talk by Google’s SRE Czar.

  • 50% dev/maintenance ratio
  • at least 5% of support tickets need to go directly to developers
  • SRE’s are free to leave any project at any time
  • in an outage: minimize impact + prevent recurrence

devops

My Philosophy on Alerting - Google Docs

Writeup from a Google SRE on alerting/monitoring. Very well thought out.

Pages should be urgent, important, actionable, and real.

  • emphasis on reducing noise levels
  • emphasis on end-to-end, black box, symptom-based alerting rather than the cause (I assume there is still enough monitoring/metrics in place to quickly diagnose the cause from the symptom)
  • a daily report can be a good channel for non-critical, but time-sensitive alerts, particularly on causes, (disk getting relatively full, unusually large numbers of slow queries, etc)
  • “Every alert should be tracked through a workflow system.” not just dumped into an IRC channel or email list.

This is good for thinking about Hound. Overall, a lot of effort has gone into making all of Hound’s alerts be “urgent, important, actionable, and real” but some fall short. Eg, quite a few currently exist that aren’t really actionable (eg, monitoring of various LITO services, Wardenclyffe -> PCP failures), that we have because we’d just rather know when something we depend on fails before our users.

Things to consider adding to Hound based on this:

  • dependency chain: link symptoms to causes so we can silence the symptom alerts when we know the cause
  • different alert targets. So we can set up alerts that only go to the people who can actually act on them, rather than dump everything to ccnmtl-sysadmin and train people to ignore a lot of them (“somebody else’s problem”).

ccnmtl devops

Netflix for the rest of us

Docker container for overseas Netflix proxy

devops