Finch: devops

Channel: devops

«
7

ops school curriculum

devops sysadmin

By anders·2015-05-10 10:30:43 +0000 UTC

rcrowley/go-metrics

By anders·2015-05-10 10:29:26 +0000 UTC

bazel: build tool from google

By anders·2015-04-07 11:57:43 +0000 UTC

Backblaze finds only a few particular SMART metrics useful for predicting and detecting hard drive failure

Backblaze uses SMART 5, 187, 188, 197 and 198 for determining the failure or potential failure of a hard drive.

devops linux sysadmin

By anders·2014-12-13 18:32:26 +0000 UTC

Guillaume’s Thoughts: Release Go code (and others) via Docker using Makefile

By anders·2014-11-11 12:41:21 +0000 UTC

Eight Docker Development Patterns

By anders·2014-10-27 20:22:24 +0000 UTC

Introducing Consul Template - HashiCorp

By anders·2014-10-22 13:36:15 +0000 UTC

lokalebasen/go-env

pull down environment variables from etcd and run a process with them

By anders·2014-10-17 19:55:41 +0000 UTC

Crash-only software: More than meets the eye [LWN.net]

“only way to stop is to crash. only way to start is to recover”

devops distributed systems

By anders·2014-10-16 20:47:25 +0000 UTC

How Google’s Build System Works

By anders·2014-10-14 14:04:58 +0000 UTC

Keys to SRE

Talk by Google’s SRE Czar.

50% dev/maintenance ratio
at least 5% of support tickets need to go directly to developers
SRE’s are free to leave any project at any time
in an outage: minimize impact + prevent recurrence

By anders·2014-10-14 12:57:10 +0000 UTC

My Philosophy on Alerting - Google Docs

Writeup from a Google SRE on alerting/monitoring. Very well thought out.

Pages should be urgent, important, actionable, and real.

emphasis on reducing noise levels
emphasis on end-to-end, black box, symptom-based alerting rather than the cause (I assume there is still enough monitoring/metrics in place to quickly diagnose the cause from the symptom)
a daily report can be a good channel for non-critical, but time-sensitive alerts, particularly on causes, (disk getting relatively full, unusually large numbers of slow queries, etc)
“Every alert should be tracked through a workflow system.” not just dumped into an IRC channel or email list.

This is good for thinking about Hound. Overall, a lot of effort has gone into making all of Hound’s alerts be “urgent, important, actionable, and real” but some fall short. Eg, quite a few currently exist that aren’t really actionable (eg, monitoring of various LITO services, Wardenclyffe -> PCP failures), that we have because we’d just rather know when something we depend on fails before our users.

Things to consider adding to Hound based on this:

dependency chain: link symptoms to causes so we can silence the symptom alerts when we know the cause
different alert targets. So we can set up alerts that only go to the people who can actually act on them, rather than dump everything to ccnmtl-sysadmin and train people to ignore a lot of them (“somebody else’s problem”).

By anders·2014-10-14 09:00:56 +0000 UTC

Netflix for the rest of us

Docker container for overseas Netflix proxy

By anders·2014-10-12 19:55:41 +0000 UTC

Jenkins no with more Gopher

Doing fun things with Go apps in Jenkins

By anders·2014-10-12 18:54:32 +0000 UTC

«
7