Data Quality Score: The Next Chapter Of Data Quality At Airbnb
- Clark Wright tl;dr: "With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb’s growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners. Weekly metric reports were difficult to land on time. Seemingly basic metrics like “Active Listings” relied on a web of upstream dependencies. Conducting meaningful data work required significant institutional knowledge to overcome hidden caveats in our data." Clark discusses the implementation of a Data Quality Score.featured in #471
How Uber Computes ETA At Half A Million Requests Per Second
tl;dr: "A single trip usually takes around 1000 ETA requests.Yet computing ETA is a difficult problem. Because the distance between the source and destination is not a straight line. Instead it consists of complex street networks and highways." Engineers split a route into smaller partitions to find the shortest path amongst each partition, factoring in variables, such as traffic.featured in #470
Real-Time Analytics For Mobile App Crashes using Apache Pinot
tl;dr: "At Uber, we have built a system called “Healthline” to help with our Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) issues and to avoid potential outages and large-scale user impacts. Due to our ability to detect the issues in real time, this has become the go-to tool for release managers to observe the impact of canary release and decide whether to proceed further or to rollback. In this article we will be sharing details on how we are leveraging Apache Pinot™ to achieve this in real time at Uber scale."featured in #463
Making Sure Your Auth System Can Scale
- James Hickey tl;dr: The balance between authentication security and performance is a perpetual challenge. This article dives into the heart of this issue, emphasizing the trade-off between stringent security practices and system scalability. You'll find practical tips to maintain secure auth while meeting customer demands, and discover strategies to make sure your systems remain secure and efficient.featured in #462
Switching Build Systems, Seamlessly
- Patrick Balestra tl;dr: Patrick chronicles Spotify's shift to Bazel. The move was driven by the need for a scalable build system for their growing codebase. The transition, which began in earnest in 2020, involved running two build systems side by side, adapting existing tools, and extensive testing. By 2023, the iOS Spotify app was fully built with Bazel, resulting in significant improvements in build times and developer experience.featured in #461
featured in #460
Maxjourney: Pushing Dicord's Limits With A Million+ Online Users In A Single Server
- Yuliy Pisetsky tl;dr: "With that growth, those servers started slowing down and creeping ever closer to their throughput limits. As that’s happened, we’ve continued to find many improvements to keep making them faster and pushing the limits out further. In this post we’ll talk about some of the ways we’ve scaled individual Discord servers from tens of thousands of concurrent users to approaching two million concurrent users in the past few years."featured in #460
How DoorDash Standardized And Improved Microservices Caching
- Jason Fan Lev Neiman tl;dr: DoorDash's expanding microservices architecture led to challenges in interservice traffic and caching. The article details how DoorDash addressed these challenges by developing a library to standardize caching, enhancing performance without altering existing business logic. Key features include layered caches, runtime feature flag control, and observability with cache shadowing. The authors also provides guidance on when to use caching.featured in #459
How GitHub Indexes Code For Blazing Fast Search & Retrieval
- Shivang Sarawagi tl;dr: “The search engine supports global queries across 200 million repos and indexes code changes in repositories within minutes. The code search index is by far the largest cluster that GitHub runs, comprising 5184 vCPUs, 40TB of RAM, and 1.25PB of backing storage, supporting a query load of 200 requests per second on average and indexing over 53 billion source files.”featured in #458
Executing Cron Scripts Reliably At Scale
- Claire Adams tl;dr: Claire discusses the challenges of managing and executing cron scripts in a reliable manner within large-scale infrastructure. “The Job Queue is an asynchronous compute platform that runs about 9 billion “jobs” or pieces of work per day.“ Claire provides insights into techniques such as distributed execution, retries, and monitoring to ensure the dependable execution of cron jobs at scale, highlighting the need for a systematic approach to handle failures effectively.featured in #457