Channel: sre

This self-hosted SaaS passed its ISO 27001 audit, here’s the dashboard that did it

devops security sre sysadmin

From static rate limiting to adaptive traffic management in AirBnB’s key value store

devops distributed systems sre

Engineering Resilience with CFR Monitoring

devops sre

How to Make Things Slower So They Go Faster

math sre

Not Causal Chains but Interactions and Adaptions

resilience sre

It’s a log eat log world! - by Obakeng Mosadi

devops sre sysadmin

Scaling Prometheus: From Single Node to Enterprise-Grade Observability

devops sre

Resilience: some key ingredients

resilience sre

Meta’s Hyperscale Infrastructure Overview and Insights

distributed systems sre sysadmin

How doctors handoff patients (how it applies to incidents) : sre

I-PASS stands for:

  • Illness Severity
  • Patient Summary
  • Action List
  • Situation Awareness & Contingency Planning
  • Synthesis by Receiver

sre

Probabilistic Increment: A Randomized Algorithm to Mitigate Hot Rows

algorithms distributed systems sre

Good Retry, Bad Retry: An Incident Story | by Denis Isaev | Yandex | Aug, 2024 | Medium

distributed systems sre

The Rule of 5 Errors - by Ross Brodbeck

sre sysadmin

SLO formulas implementation in PromQL step by step

sre

Failsafe-go - Fault tolerance and resilience patterns for Go

golang sre

SRE Archetypes

sre

generic mitigations

sre

Metastable failures in the wild

sre

Time based vs Event based SLIs - Alex Ewerl

devops sre

Embrace Complexity; Tighten Your Feedback Loops

resilience sre

The Case of the Recursive Resolvers - Slack Engineering

networking sre sysadmin

Uptime guarantees: A pragmatic perspective – great explanation of SLOs and 9’s.

sre

Read Every Single Error | Pulumi Blog

sre

mercari/production-readiness-checklist: Production readiness checklist used for Mercari and Merpay microservices

good production readiness documents.

sre