How We Built A Self-Healing System To Survive A Terrifying Concurrency Bug At Netflix
- Matthew Hawthorne tl;dr: “It was a Friday afternoon, and I heard a lot of commotion. I emerged from my cubicle to see my colleagues passionately discussing a problem: our CPU usage was slowly growing across our cluster.”featured in #567
5 Lessons Design Systems Teams Can Learn From Open-Source Maintainers
- Nick Moore tl;dr: Design system teams constantly balance the evangelism of their standards and components while maintaining and evolving them. Turns out, the same is true of open source maintainers. This post suggests 5 ways enterprise design system teams can learn from open source to better demonstrate their value, improve the end-user experience, and reclaim time spent on maintenance for evolving and improving the system as a whole.featured in #485
Standing On The Shoulders Of Giants: Colm On Constant Work
- Werner Vogels tl;dr: This is why many of our most reliable systems use very simple, very dumb, very reliable constant work patterns. Just like coffee urns. These patterns have three key features. (1) They don’t scale up or slow down with load or stress. (2) They don’t have modes, which means they do the same operations in all conditions. (3) If they have any variation, it’s to do less work in times of stress so they can perform better when you need them most. There’s that anti-fragility again.featured in #466
Tumblr Shares Database Migration Strategy With 60+ Billion Rows
tl;dr: The article delves into Tumblr's database migration strategy. With a massive MySQL database spanning 21 terabytes and 60+ billion rows, Tumblr sought a migration approach that minimized user impact. Initially considering a brute force method, they later adopted the CQRS pattern, which separates database read and write operations. To combat latency issues, Tumblr introduced a database proxy in the local data center, which maintained persistent connections to the remote leader and allowed for connection pooling. This strategy ensured minimal user disruption during migration.featured in #447
featured in #422
featured in #419
featured in #404
System Design Interview Cheat Sheet
tl;dr: “The system design questions are subjective. This cheat sheet is a work in progress and is written based on my research on the topic.” Topics include databases, API design, capacity planning, high level design, design deep dives, and more.featured in #403
featured in #386
McDonald’s Event-Driven Architecture: The Data Journey And How It Works
- Vamshi Krishna Komuravalli Damian Sullivan tl;dr: Here is a typical data flow of how events are reliably produced and consumed from the platform: (1) Initially, an event schema is defined and registered in the schema registry. (2) Applications that need to produce events leverage producer SDK to publish events. (3) When an application starts up, an event schema is cached in the producing application for high performance. The authors continue to discuss how data flows through the system.featured in #380