Essential Reading For Engineering Leaders

Data Quality Score: The Next Chapter Of Data Quality At Airbnb

- Clark Wright

Data
Scale

tl;dr: "With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb’s growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners. Weekly metric reports were difficult to land on time. Seemingly basic metrics like “Active Listings” relied on a web of upstream dependencies. Conducting meaningful data work required significant institutional knowledge to overcome hidden caveats in our data." Clark discusses the implementation of a Data Quality Score.

featured in #471

How Uber Computes ETA At Half A Million Requests Per Second

Scale
Algo

tl;dr: "A single trip usually takes around 1000 ETA requests.Yet computing ETA is a difficult problem. Because the distance between the source and destination is not a straight line. Instead it consists of complex street networks and highways." Engineers split a route into smaller partitions to find the shortest path amongst each partition, factoring in variables, such as traffic.

featured in #470

Real-Time Analytics For Mobile App Crashes using Apache Pinot

tl;dr: "At Uber, we have built a system called “Healthline” to help with our Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) issues and to avoid potential outages and large-scale user impacts. Due to our ability to detect the issues in real time, this has become the go-to tool for release managers to observe the impact of canary release and decide whether to proceed further or to rollback. In this article we will be sharing details on how we are leveraging Apache Pinot™ to achieve this in real time at Uber scale."

featured in #463

Making Sure Your Auth System Can Scale

- James Hickey

Management
Scale

tl;dr: The balance between authentication security and performance is a perpetual challenge. This article dives into the heart of this issue, emphasizing the trade-off between stringent security practices and system scalability. You'll find practical tips to maintain secure auth while meeting customer demands, and discover strategies to make sure your systems remain secure and efficient.

featured in #462

Switching Build Systems, Seamlessly

- Patrick Balestra

Scale
Infrastructure

tl;dr: Patrick chronicles Spotify's shift to Bazel. The move was driven by the need for a scalable build system for their growing codebase. The transition, which began in earnest in 2020, involved running two build systems side by side, adapting existing tools, and extensive testing. By 2023, the iOS Spotify app was fully built with Bazel, resulting in significant improvements in build times and developer experience.

featured in #461

Automating Dead Code Cleanup

Infrastructure
Scale

tl;dr: "SCARF contains a subsystem that automatically identifies dead code through a combination of static, runtime, and application analysis. It leverages this analysis to submit change requests to remove this code from our systems. This automated dead code removal improves the quality of our systems and also unblocks unused data removal in SCARF when the dead code includes references to data assets that prevent automated data cleanup. "

featured in #460

Maxjourney: Pushing Dicord's Limits With A Million+ Online Users In A Single Server

- Yuliy Pisetsky

tl;dr: "With that growth, those servers started slowing down and creeping ever closer to their throughput limits. As that’s happened, we’ve continued to find many improvements to keep making them faster and pushing the limits out further. In this post we’ll talk about some of the ways we’ve scaled individual Discord servers from tens of thousands of concurrent users to approaching two million concurrent users in the past few years."

featured in #460

How DoorDash Standardized And Improved Microservices Caching

- Jason Fan Lev Neiman

tl;dr: DoorDash's expanding microservices architecture led to challenges in interservice traffic and caching. The article details how DoorDash addressed these challenges by developing a library to standardize caching, enhancing performance without altering existing business logic. Key features include layered caches, runtime feature flag control, and observability with cache shadowing. The authors also provides guidance on when to use caching.

featured in #459

How GitHub Indexes Code For Blazing Fast Search & Retrieval

- Shivang Sarawagi

tl;dr: “The search engine supports global queries across 200 million repos and indexes code changes in repositories within minutes. The code search index is by far the largest cluster that GitHub runs, comprising 5184 vCPUs, 40TB of RAM, and 1.25PB of backing storage, supporting a query load of 200 requests per second on average and indexing over 53 billion source files.”

featured in #458

Executing Cron Scripts Reliably At Scale

- Claire Adams

Scale
Infrastructure

tl;dr: Claire discusses the challenges of managing and executing cron scripts in a reliable manner within large-scale infrastructure. “The Job Queue is an asynchronous compute platform that runs about 9 billion “jobs” or pieces of work per day.“ Claire provides insights into techniques such as distributed execution, retries, and monitoring to ensure the dependable execution of cron jobs at scale, highlighting the need for a systematic approach to handle failures effectively.

featured in #457

/Scale