Why I Don’t Like Discussing Action Items During Incident Reviews
- Lorin Hochstein tl;dr: (1) Nobody fully understands how the system works. (2) The gaps in our understanding of how the system works contributes to incidents. (3) The way that work is done profoundly affects incidents, both positively and negatively, but that work is mostly invisible. (4) Incident reviews are an opportunity for many people to gain insight into how the system works. (5) The best way to get a better understanding of how the system behaves is to look at how the system actually behaved. And more.featured in #556
Leveraging AI For Efficient Incident Response
tl;dr: From the team at Meta, “We’re leveraging AI to advance our investigation tools even further. We’ve streamlined our investigations through a combination of heuristic-based retrieval and large language model (LLM)-based ranking to provide AI-assisted root cause analysis. During backtesting, this system has achieved promising results: 42% accuracy in identifying root causes for investigations at their creation time related to our web monorepo.”featured in #526
Incident Categories I’d Like To See
- Lorin Hochstein tl;dr: If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research: (1) Production pressure. (2) Goal conflicts. (3) Workarounds. (4) Automation surprises.featured in #376
Without Prep, Even The Most Scalable And Reliable Developer Tools Can Be Hit With Outages
tl;dr: Get actionable tactics from the experts who built incident response frameworks for Snyk, PagerDuty, New Relic, Netflix, Chef, and Amazon at the DevGuild: Incident Response conference on Nov 15-17. Avoid costly outages - secure your free ticket.featured in #365