Building And Operating A Pretty Big Storage System Called S3
- Werner Vogels tl;dr: A repost of an article by Andy Warfield, VP of S3, reflects on the vast complexity and operational scale of Amazon's storage software system. Andy discusses the significance of recognizing and mitigating organizational scaling issues, similar to optimizing systems. He also discusses management’s approach to foster team ownership for problem-solving instead of dispensing solutions has led to more engaged and successful engineering outcomes.featured in #436
Measuring Performance For iOS Apps At Uber Scale
tl;dr: This article discusses how Uber measures performance metrics, specifically focusing on app startup performance on iOS. The article mentions that Uber monitors various critical metrics such as UI flow latency, memory usage, bandwidth, and UI jank. App launch times are highlighted as a crucial industry-standard metric that directly impacts the customer experience.featured in #431
Meta Developer Tools: Working At Scale
- Neil Mitchell tl;dr: “Every day, thousands of developers at Meta are working in repositories with millions of files. Those developers need tools that help them at every stage of the workflow while working at extreme scale. In this article we’ll go through a few of the tools in the development process. And, as an added bonus, those we talk about below are open source so you can try them yourself.”featured in #430
Upscaling LinkedIn's Profile Datastore While Reducing Costs
- Estella Pham Guanlin Lu tl;dr: LinkedIn introduced Couchbase as a centralized storage tier cache to address scaling concerns. Challenges arose due to the cache not being backed by primary storage. The blog post discusses the decision, challenges faced, and solutions employed to achieve high cache hit rate, reduced latencies, and cost savings.featured in #428
Migrating Netflix To GraphQL Safely
tl;dr: “Doing this safely for 100s of millions of customers without disruption is exceptionally challenging, especially considering the many dimensions of change involved. This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.”featured in #423
Migrating Critical Traffic At Scale With No Downtime — Part 2
tl;dr: “Replay traffic testing gives us the initial foundation of validation, but as our migration process unfolds, we are met with the need for a carefully controlled migration process. A process that doesn’t just minimize risk, but also facilitates a continuous evaluation of the rollout’s impact. This blog post will delve into the techniques leveraged at Netflix to introduce these changes to production.”featured in #422
featured in #416
Migrating Critical Traffic At Scale With No Downtime
tl;dr: From the team at Netflix: “when undertaking system migrations, one of the main challenges is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. This blog series will examine the tools, techniques, and strategies we utilized to achieve this goal.”featured in #415
Scaling Up The Prime Video Audio / Video Monitoring Service And Reducing Costs By 90%
- Marcin Kolny tl;dr: “To ensure that customers seamlessly receive content, Prime Video set up a tool to monitor every stream viewed by customers. This tool allows us to automatically identify perceptual quality issues and trigger a process to fix them.” Marcin discusses how the service’s architecture.featured in #412
How LinkedIn Adopted A GraphQL Architecture For Product Development
- Arun Sethuramalingam tl;dr: “In this blog post, we will cover how the GraphQL layer is architected for use by our internal engineers to build member and customer facing applications. Specifically, we will dive into some of the architectural choices that are unique to LinkedIn and why we chose each one of them.”featured in #411