/Scale

Building And Operating A Pretty Big Storage System Called S3

- Werner Vogels tl;dr: A repost of an article by Andy Warfield, VP of S3, reflects on the vast complexity and operational scale of Amazon's storage software system. Andy discusses the significance of recognizing and mitigating organizational scaling issues, similar to optimizing systems. He also discusses management’s approach to foster team ownership for problem-solving instead of dispensing solutions has led to more engaged and successful engineering outcomes.

featured in #436


Measuring Performance For iOS Apps At Uber Scale

tl;dr: This article discusses how Uber measures performance metrics, specifically focusing on app startup performance on iOS. The article mentions that Uber monitors various critical metrics such as UI flow latency, memory usage, bandwidth, and UI jank. App launch times are highlighted as a crucial industry-standard metric that directly impacts the customer experience.

featured in #431


Meta Developer Tools: Working At Scale

- Neil Mitchell tl;dr: “Every day, thousands of developers at Meta are working in repositories with millions of files. Those developers need tools that help them at every stage of the workflow while working at extreme scale. In this article we’ll go through a few of the tools in the development process. And, as an added bonus, those we talk about below are open source so you can try them yourself.”

featured in #430


Upscaling LinkedIn's Profile Datastore While Reducing Costs

- Estella Pham Guanlin Lu tl;dr: LinkedIn introduced Couchbase as a centralized storage tier cache to address scaling concerns. Challenges arose due to the cache not being backed by primary storage. The blog post discusses the decision, challenges faced, and solutions employed to achieve high cache hit rate, reduced latencies, and cost savings.  

featured in #428


Migrating Netflix To GraphQL Safely

tl;dr: “Doing this safely for 100s of millions of customers without disruption is exceptionally challenging, especially considering the many dimensions of change involved. This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.”

featured in #423


Migrating Critical Traffic At Scale With No Downtime — Part 2

tl;dr: “Replay traffic testing gives us the initial foundation of validation, but as our migration process unfolds, we are met with the need for a carefully controlled migration process. A process that doesn’t just minimize risk, but also facilitates a continuous evaluation of the rollout’s impact. This blog post will delve into the techniques leveraged at Netflix to introduce these changes to production.”

featured in #422


The Simple Joys Of Scaling Up

- Jordan Tigani tl;dr: “After such a dramatic increase in hardware capability, we should ask ourselves, “Do the conditions that drove our scaling challenges in 2003 still exist?” After all, we’ve made our systems far more complex and added a lot of overhead. Is it all still necessary? If you can do the job on a single machine, isn’t that going to be a better alternative?” This post digs into why scale-out became so dominant, take a look at whether those rationales still hold, and then explore some advantages of such architecture.

featured in #416


Migrating Critical Traffic At Scale With No Downtime

tl;dr: From the team at Netflix: “when undertaking system migrations, one of the main challenges is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. This blog series will examine the tools, techniques, and strategies we utilized to achieve this goal.”

featured in #415


Scaling Up The Prime Video Audio / Video Monitoring Service And Reducing Costs By 90%

- Marcin Kolny tl;dr: “To ensure that customers seamlessly receive content, Prime Video set up a tool to monitor every stream viewed by customers. This tool allows us to automatically identify perceptual quality issues and trigger a process to fix them.” Marcin discusses how the service’s architecture.

featured in #412


How LinkedIn Adopted A GraphQL Architecture For Product Development

- Arun Sethuramalingam tl;dr: “In this blog post, we will cover how the GraphQL layer is architected for use by our internal engineers to build member and customer facing applications. Specifically, we will dive into some of the architectural choices that are unique to LinkedIn and why we chose each one of them.”

featured in #411