That Time We Unplugged A Data Center To Test Our Disaster Readiness
- Krishelle Hardson-Hurley Ross Delinger Tong Pham tl;dr: "One way of communicating our preparedness to our customers is through a metric called Recovery Time Objective (RTO). RTO measures the amount of time we promise it will take to recover from a catastrophic event." Dropbox reduced its RTO by "more than an order of magnitude," as discussed in this post.featured in #313
Building Resilient Services At Prime Video With Chaos Engineering
- Varun Jewalikar Adrian Hornsby tl;dr: "A simple approach for fault injection in systems utilizing Amazon EC2 and ECS, and its integration with a load-testing suite to validate the countermeasures put in place to prevent dependency and resource exhaustion failures."featured in #201
featured in #138