/Scale

Real-Time Analytics For Mobile App Crashes using Apache Pinot

tl;dr:  "At Uber, we have built a system called “Healthline” to help with our Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) issues and to avoid potential outages and large-scale user impacts. Due to our ability to detect the issues in real time, this has become the go-to tool for release managers to observe the impact of canary release and decide whether to proceed further or to rollback. In this article we will be sharing details on how we are leveraging Apache Pinot™ to achieve this in real time at Uber scale."

featured in #463


Making Sure Your Auth System Can Scale

- James Hickey tl;dr: The balance between authentication security and performance is a perpetual challenge. This article dives into the heart of this issue, emphasizing the trade-off between stringent security practices and system scalability. You'll find practical tips to maintain secure auth while meeting customer demands, and discover strategies to make sure your systems remain secure and efficient.

featured in #462


Switching Build Systems, Seamlessly

- Patrick Balestra tl;dr: Patrick chronicles Spotify's shift to Bazel. The move was driven by the need for a scalable build system for their growing codebase. The transition, which began in earnest in 2020, involved running two build systems side by side, adapting existing tools, and extensive testing. By 2023, the iOS Spotify app was fully built with Bazel, resulting in significant improvements in build times and developer experience.

featured in #461


Automating Dead Code Cleanup

tl;dr: "SCARF contains a subsystem that automatically identifies dead code through a combination of static, runtime, and application analysis. It leverages this analysis to submit change requests to remove this code from our systems. This automated dead code removal improves the quality of our systems and also unblocks unused data removal in SCARF when the dead code includes references to data assets that prevent automated data cleanup. "

featured in #460


Maxjourney: Pushing Dicord's Limits With A Million+ Online Users In A Single Server

- Yuliy Pisetsky tl;dr: "With that growth, those servers started slowing down and creeping ever closer to their throughput limits. As that’s happened, we’ve continued to find many improvements to keep making them faster and pushing the limits out further. In this post we’ll talk about some of the ways we’ve scaled individual Discord servers from tens of thousands of concurrent users to approaching two million concurrent users in the past few years."

featured in #460


How DoorDash Standardized And Improved Microservices Caching

- Jason Fan Lev Neiman tl;dr: DoorDash's expanding microservices architecture led to challenges in interservice traffic and caching. The article details how DoorDash addressed these challenges by developing a library to standardize caching, enhancing performance without altering existing business logic. Key features include layered caches, runtime feature flag control, and observability with cache shadowing. The authors also provides guidance on when to use caching.

featured in #459


How GitHub Indexes Code For Blazing Fast Search & Retrieval

- Shivang Sarawagi tl;dr: “The search engine supports global queries across 200 million repos and indexes code changes in repositories within minutes. The code search index is by far the largest cluster that GitHub runs, comprising 5184 vCPUs, 40TB of RAM, and 1.25PB of backing storage, supporting a query load of 200 requests per second on average and indexing over 53 billion source files.”

featured in #458


Executing Cron Scripts Reliably At Scale

- Claire Adams tl;dr: Claire discusses the challenges of managing and executing cron scripts in a reliable manner within large-scale infrastructure. “The Job Queue is an asynchronous compute platform that runs about 9 billion “jobs” or pieces of work per day.“ Claire provides insights into techniques such as distributed execution, retries, and monitoring to ensure the dependable execution of cron jobs at scale, highlighting the need for a systematic approach to handle failures effectively.

featured in #457


From Big Data To Better Data: Ensuring Data Quality With Verity

- Michael McPhillips tl;dr: Michael emphasizes that "data quality is paramount for accurate insights," highlighting the challenge of ensuring data reliability. Michael introduces Lyft’s in-house data quality platform, Verity, which has an exhaustive flow that starts with the following steps: (1) Data Profiling: Incoming data is scrutinized for its structure, schema, and content. This allows it to identify potential anomalies and inconsistencies. (2) Customizable Rules Engine: Enables data experts to define specific data quality rules tailored to their unique needs. These rules encompass everything from data format validations to more intricate domain-specific checks. (3) Automated Quality Checks: Once the rules are set, they are applied to incoming data streams, scanning each data point, seeking discrepancies.

featured in #457


How Pinterest Scaled To 11 million Users With Only 6 Engineers

tl;dr: In 2012, Pinterest reached 11.7 million monthly users with just six engineers. The article chronicles Pinterest's journey from its launch in 2010 with a single engineer to its rapid growth. Key lessons include using proven technologies, keeping architecture simple, and avoiding over-complication. Pinterest faced challenges like data corruption due to clustering and had to pivot to more reliable technologies like MySQL and Memcached. By January 2012, they simplified their stack, removing less-proven technologies and focusing on manual database sharding for scalability.

featured in #454