/Lorin Hochstein

The Canva Outage: Another Tale Of Saturation And Resilience tl;dr: “Today’s public incident writeup comes courtesy of Brendan Humphries, the CTO of Canva. Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload. There’s a lot of great detail in Humpries’s write-up, and I recommend you read it directly in addition to this post.”

featured in #581


Why I Don’t Like Discussing Action Items During Incident Reviews tl;dr: (1) Nobody fully understands how the system works. (2) The gaps in our understanding of how the system works contributes to incidents. (3) The way that work is done profoundly affects incidents, both positively and negatively, but that work is mostly invisible. (4) Incident reviews are an opportunity for many people to gain insight into how the system works. (5) The best way to get a better understanding of how the system behaves is to look at how the system actually behaved. And more.

featured in #556


What If Everybody Did Everything Right? tl;dr: In the wake of an incident, we are inevitably led to answer two questions: “What did we do wrong here? What didn’t we do that we should have?” Lorin argues these questions create a specific lens to scrutinize the incident. “An alternative lens for making sense of an incident is to ask the question “how did this incident happen, assuming that everybody did everything right?” Assume that everybody whose actions contributed to the incident made the best possible decision based on the information they had, and the constraints and incentives that were imposed upon them.” This incites different questions: (1) What information did people know in the moment? (2) What were the constraints that people were operating under?

featured in #492


Incident Categories I’d Like To See tl;dr: If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research: (1) Production pressure. (2) Goal conflicts. (3) Workarounds. (4) Automation surprises.

featured in #376


Writing Docs Well: Why Should A Software Engineer Care? tl;dr: Lorin recently gave a lecture in a graduate software engineering course on the value of technical writing for software engineers. There are 3 goals when writing: (1) Building shared understanding. (2) A tool for your own thinking. (3) Influence in a larger org when you’re at the bottom of the hierarchy. Lorin also advises on how to improve technical writing. 

featured in #374


OOPS Writeups tl;dr: Operational Surprises (OOPS) is when something unexpected happened in operations and presents an opportunity to discover how the observed system behavior deviated from the mental model of how the system is supposed to behave. The template shared in this post is based on the template used at Netflix.

featured in #274


Root Cause Of Failure, Root Cause Of Success tl;dr: “Root cause of failure” doesn’t make sense in the context of complex systems failure, because a collection of control processes keep the system up and running. A system failure is a failure of this overall set of processes." Lorin draws an analogy to illustrate this and points to the fact that if there's no root cause of success, why should there be one for failure.

featured in #251