Channel: devops

ops school curriculum

devops sysadmin

rcrowley/go-metrics

devops golang

bazel: build tool from google

devops

Backblaze finds only a few particular SMART metrics useful for predicting and detecting hard drive failure

Backblaze uses SMART 5, 187, 188, 197 and 198 for determining the failure or potential failure of a hard drive.

devops linux sysadmin

Guillaume’s Thoughts: Release Go code (and others) via Docker using Makefile

devops golang

Eight Docker Development Patterns

devops

Introducing Consul Template - HashiCorp

devops

lokalebasen/go-env

pull down environment variables from etcd and run a process with them

devops golang

Crash-only software: More than meets the eye [LWN.net]

“only way to stop is to crash. only way to start is to recover”

devops distributed systems

How Google’s Build System Works

devops

Keys to SRE

Talk by Google’s SRE Czar.

  • 50% dev/maintenance ratio
  • at least 5% of support tickets need to go directly to developers
  • SRE’s are free to leave any project at any time
  • in an outage: minimize impact + prevent recurrence

devops

My Philosophy on Alerting - Google Docs

Writeup from a Google SRE on alerting/monitoring. Very well thought out.

Pages should be urgent, important, actionable, and real.

  • emphasis on reducing noise levels
  • emphasis on end-to-end, black box, symptom-based alerting rather than the cause (I assume there is still enough monitoring/metrics in place to quickly diagnose the cause from the symptom)
  • a daily report can be a good channel for non-critical, but time-sensitive alerts, particularly on causes, (disk getting relatively full, unusually large numbers of slow queries, etc)
  • “Every alert should be tracked through a workflow system.” not just dumped into an IRC channel or email list.

This is good for thinking about Hound. Overall, a lot of effort has gone into making all of Hound’s alerts be “urgent, important, actionable, and real” but some fall short. Eg, quite a few currently exist that aren’t really actionable (eg, monitoring of various LITO services, Wardenclyffe -> PCP failures), that we have because we’d just rather know when something we depend on fails before our users.

Things to consider adding to Hound based on this:

  • dependency chain: link symptoms to causes so we can silence the symptom alerts when we know the cause
  • different alert targets. So we can set up alerts that only go to the people who can actually act on them, rather than dump everything to ccnmtl-sysadmin and train people to ignore a lot of them (“somebody else’s problem”).

ccnmtl devops

Netflix for the rest of us

Docker container for overseas Netflix proxy

devops

Jenkins no with more Gopher

Doing fun things with Go apps in Jenkins

devops golang