/Scale

In Defense Of Simple Architectures

- Dan Luu tl;dr: Dan discusses the effectiveness of simple architectures in software development, using Wave, a $1.7B company, as an example. Wave's architecture is a Python monolith on top of Postgres, allowing engineers to focus on delivering value to users. The article emphasizes that simple architectures can be created more cheaply and easily than complex ones, even for high-traffic apps. Despite the trend towards complex, microservice-based architectures, Dan argues for the "unreasonable effectiveness" of monoliths, detailing Wave's choices, mistakes, and areas of unavoidable complexity. Simplicity in architecture can lead to success, allowing companies to allocate complexity where it benefits the business.

featured in #439


How We Built The Canva Apps SDK

- Martin Cronjé tl;dr: Martin’s article outlines the development of the Canva Apps SDK, transitioning from a plugin model to a more flexible app-building platform. The process involved building a secure sandboxed environment, creating a new build-and-deploy pipeline, and designing APIs with a focus on simplicity, safety, evolvability, and consistency. Iterative development, continuous feedback, and a balance between alignment and empowerment were key technical strategies in the SDK's creation.

featured in #437


Building And Operating A Pretty Big Storage System Called S3

- Werner Vogels tl;dr: A repost of an article by Andy Warfield, VP of S3, reflects on the vast complexity and operational scale of Amazon's storage software system. Andy discusses the significance of recognizing and mitigating organizational scaling issues, similar to optimizing systems. He also discusses management’s approach to foster team ownership for problem-solving instead of dispensing solutions has led to more engaged and successful engineering outcomes.

featured in #436


Measuring Performance For iOS Apps At Uber Scale

tl;dr: This article discusses how Uber measures performance metrics, specifically focusing on app startup performance on iOS. The article mentions that Uber monitors various critical metrics such as UI flow latency, memory usage, bandwidth, and UI jank. App launch times are highlighted as a crucial industry-standard metric that directly impacts the customer experience.

featured in #431


Meta Developer Tools: Working At Scale

- Neil Mitchell tl;dr: “Every day, thousands of developers at Meta are working in repositories with millions of files. Those developers need tools that help them at every stage of the workflow while working at extreme scale. In this article we’ll go through a few of the tools in the development process. And, as an added bonus, those we talk about below are open source so you can try them yourself.”

featured in #430


Upscaling LinkedIn's Profile Datastore While Reducing Costs

- Estella Pham Guanlin Lu tl;dr: LinkedIn introduced Couchbase as a centralized storage tier cache to address scaling concerns. Challenges arose due to the cache not being backed by primary storage. The blog post discusses the decision, challenges faced, and solutions employed to achieve high cache hit rate, reduced latencies, and cost savings.  

featured in #428


Migrating Netflix To GraphQL Safely

tl;dr: “Doing this safely for 100s of millions of customers without disruption is exceptionally challenging, especially considering the many dimensions of change involved. This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.”

featured in #423


Migrating Critical Traffic At Scale With No Downtime — Part 2

tl;dr: “Replay traffic testing gives us the initial foundation of validation, but as our migration process unfolds, we are met with the need for a carefully controlled migration process. A process that doesn’t just minimize risk, but also facilitates a continuous evaluation of the rollout’s impact. This blog post will delve into the techniques leveraged at Netflix to introduce these changes to production.”

featured in #422


The Simple Joys Of Scaling Up

- Jordan Tigani tl;dr: “After such a dramatic increase in hardware capability, we should ask ourselves, “Do the conditions that drove our scaling challenges in 2003 still exist?” After all, we’ve made our systems far more complex and added a lot of overhead. Is it all still necessary? If you can do the job on a single machine, isn’t that going to be a better alternative?” This post digs into why scale-out became so dominant, take a look at whether those rationales still hold, and then explore some advantages of such architecture.

featured in #416


Migrating Critical Traffic At Scale With No Downtime

tl;dr: From the team at Netflix: “when undertaking system migrations, one of the main challenges is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. This blog series will examine the tools, techniques, and strategies we utilized to achieve this goal.”

featured in #415