Essential Reading For Engineering Leaders

- Dan Luu

Architecture
Scale

tl;dr: Dan discusses the effectiveness of simple architectures in software development, using Wave, a $1.7B company, as an example. Wave's architecture is a Python monolith on top of Postgres, allowing engineers to focus on delivering value to users. The article emphasizes that simple architectures can be created more cheaply and easily than complex ones, even for high-traffic apps. Despite the trend towards complex, microservice-based architectures, Dan argues for the "unreasonable effectiveness" of monoliths, detailing Wave's choices, mistakes, and areas of unavoidable complexity. Simplicity in architecture can lead to success, allowing companies to allocate complexity where it benefits the business.

featured in #439

How We Built The Canva Apps SDK

- Martin Cronjé

SDK
Scale

tl;dr: Martin’s article outlines the development of the Canva Apps SDK, transitioning from a plugin model to a more flexible app-building platform. The process involved building a secure sandboxed environment, creating a new build-and-deploy pipeline, and designing APIs with a focus on simplicity, safety, evolvability, and consistency. Iterative development, continuous feedback, and a balance between alignment and empowerment were key technical strategies in the SDK's creation.

featured in #437

Building And Operating A Pretty Big Storage System Called S3

- Werner Vogels

tl;dr: A repost of an article by Andy Warfield, VP of S3, reflects on the vast complexity and operational scale of Amazon's storage software system. Andy discusses the significance of recognizing and mitigating organizational scaling issues, similar to optimizing systems. He also discusses management’s approach to foster team ownership for problem-solving instead of dispensing solutions has led to more engaged and successful engineering outcomes.

featured in #436

Measuring Performance For iOS Apps At Uber Scale

Mobile
Scale

tl;dr: This article discusses how Uber measures performance metrics, specifically focusing on app startup performance on iOS. The article mentions that Uber monitors various critical metrics such as UI flow latency, memory usage, bandwidth, and UI jank. App launch times are highlighted as a crucial industry-standard metric that directly impacts the customer experience.

featured in #431

Meta Developer Tools: Working At Scale

- Neil Mitchell

DevTools
Scale

tl;dr: “Every day, thousands of developers at Meta are working in repositories with millions of files. Those developers need tools that help them at every stage of the workflow while working at extreme scale. In this article we’ll go through a few of the tools in the development process. And, as an added bonus, those we talk about below are open source so you can try them yourself.”

featured in #430

Upscaling LinkedIn's Profile Datastore While Reducing Costs

- Estella Pham Guanlin Lu

Scale

tl;dr: LinkedIn introduced Couchbase as a centralized storage tier cache to address scaling concerns. Challenges arose due to the cache not being backed by primary storage. The blog post discusses the decision, challenges faced, and solutions employed to achieve high cache hit rate, reduced latencies, and cost savings.

featured in #428

Migrating Netflix To GraphQL Safely

GraphQL
Scale

tl;dr: “Doing this safely for 100s of millions of customers without disruption is exceptionally challenging, especially considering the many dimensions of change involved. This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.”

featured in #423

Migrating Critical Traffic At Scale With No Downtime — Part 2

Scale
Architecture

tl;dr: “Replay traffic testing gives us the initial foundation of validation, but as our migration process unfolds, we are met with the need for a carefully controlled migration process. A process that doesn’t just minimize risk, but also facilitates a continuous evaluation of the rollout’s impact. This blog post will delve into the techniques leveraged at Netflix to introduce these changes to production.”

featured in #422

The Simple Joys Of Scaling Up

- Jordan Tigani

Scale
Architecture

tl;dr: “After such a dramatic increase in hardware capability, we should ask ourselves, “Do the conditions that drove our scaling challenges in 2003 still exist?” After all, we’ve made our systems far more complex and added a lot of overhead. Is it all still necessary? If you can do the job on a single machine, isn’t that going to be a better alternative?” This post digs into why scale-out became so dominant, take a look at whether those rationales still hold, and then explore some advantages of such architecture.

featured in #416

Migrating Critical Traffic At Scale With No Downtime

Scale
Migration

tl;dr: From the team at Netflix: “when undertaking system migrations, one of the main challenges is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. This blog series will examine the tools, techniques, and strategies we utilized to achieve this goal.”

featured in #415

/Scale