Lessons Learned From Twenty Years Of Site Reliability Engineering
tl;dr: (1) The riskiness of a mitigation should scale with the severity of the outage. (2) Recovery mechanisms should be fully tested before an emergency. (3) Canary all changes. (4) Have a "Big Red Button". (5) Unit tests alone are not enough - integration testing is also needed. (6) Communication channels and backup channels. (7) Intentionally degrade performance modes. (8) Test for Disaster resilience. (9) Automate your mitigations. (10) Reduce the time between rollouts, to decrease the likelihood of the rollout going wrong. (11) A single global hardware version is a single point of failure.featured in #463
Service Delivery Index: A Driver for Reliability
- Matthew McKeen Ryan Katkov tl;dr: The article introduces the Service Delivery Index – Reliability (SDI-R), a metric designed to measure and drive service reliability at Slack. As the company grew, the need to move from a reactive to a proactive approach to reliability became evident. SDI-R, a composite metric of successful API calls, content delivery, and user workflows, provides a common understanding of reliability across the organization. It helps in spotting trends, identifying regressions, and setting customer expectations. The article details how SDI-R evolved, the tools and processes that support it, and the lessons learned.featured in #440
Delivering Large-Scale Platform Reliability
- Alberto Covarrubias tl;dr: Alberto covers: (1) Having the right measurement in place (2) Proactive anticipation by performing activities i.e. architectural reviews and dependency risk assessments. (3) Prioritize correction, bringing attention to incident report resolution for the service and dependencies linked to our service.featured in #314