Essential Reading For Engineering Leaders

Ledger: Stripe’s System For Tracking And Validating Money Movement

- Ilya Ganelin

Scale
Architecture

tl;dr: “Ledger models internal data-producing systems with common patterns, and it relies on proactive alerting to surface issues and proposed solutions. Each day, Ledger sees five billion events and 99.99% of our dollar volume is fully ingested and verified within four days. Of that activity, 99.999% is monitored, categorized, and triaged through rich investigative tooling — while the remaining long-tail is reliably handled through manual analysis.” This post shares technical details on how Stripe built this money movement tracking system, and how teams interact with the data quality metrics that underlie our global payments network.

featured in #490

How Disney+ Hotstar Delivered 5 Billion Emojis in Real Time

- Neo Kim

Scale
Architecture

tl;dr: This post outlines how Disney+ Hotstar delivered billions of emojis in real-time during the cricket World Cup in India to create a more engaging live experience. The post described how emojis were received from clients, processed and delivered at scale.

featured in #489

Data-Caching Techniques For 1.2 Billion Daily API Requests

- Guillermo Pérez

Scale
API
Cache

tl;dr: The cache needs to achieve three things: (1) Low latency: It needs to be fast. If a cache server has issues, you can’t retry. (2) Up and warm: It needs to hold the majority of the critical data. If you lose it, it would surely bring down the backend systems with too much load. (3) Consistency: It should never hold stale or incorrect data. “A lot of the techniques mentioned in this article are supported by our open source meta-memcache cache client.”

featured in #486

Meta's Serverless Platform Processes Trillions Of Function Calls A Day

- Leonardo Creed

Scale
Meta

tl;dr: Meta’s XFaaS is their serverless platform that processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions. XFaaS is Meta’s internal version of public Function-as-a-Service (FaaS) options, such as AWS Lambda, Google Cloud Functions, and Azure Functions. Leonardo shares his high-level takeaways and lessons and then a more detailed walkthrough about the architecture behind XFaaS.

featured in #485

Sensenmann: Code Deletion At Scale

- Phil Norman

Process
Scale

tl;dr: “What if we could clean up dead code automatically? That was exactly what people started thinking several years ago, during the Zürich Engineering Productivity team's annual hackathon. The Sensenmann project, named after the German word for the embodiment of Death, has been highly successful. It submits over 1000 deletion changelists per week, and has so far deleted nearly 5% of all C++ at Google. Its goal is simple (at least, in principle): automatically identify dead code, and send code review requests to delete it.” Phil discusses its logic.

featured in #482

How Apple Built iCloud To Store Billions Of Databases

- Leonardo Creed

Architecture
Scale

tl;dr: Apple uses FoundationDB and Cassandra for iCloud and CloudKit, their cloud backend service, running one of the largest Cassandra deployments in the world. This deployment includes over 300,000 instances or nodes, managing hundreds of petabytes of data, possibly extending to exabytes. Each cluster in this deployment can handle over two petabytes of data, and there are thousands of such clusters. This indicates a highly distributed and scalable storage system spread across multiple data centers.

featured in #480

How Discord Serves 15-Million Users On One Server

- Alex Xu

Scale
Architecture

tl;dr: "Internally, each Discord community is called a “guild”. A dedicated Elixir “guild process” handles coordination and routing for each guild. This tracks all connected users to the guild. Every online user has a separate Elixir "session process”. When the guild process gets a new message, event, or update, it fans out this information to the relevant session processes. These session processes then push the update over WebSocket to the Discord clients."

featured in #479

How Uber Finds Nearby Drivers At 1 Million Requests Per Second

Scale
Architecture

tl;dr: H3 is a hexagonal-shaped hierarchical geospatial indexing system created at Uber dividing Earth’s surface into cells on a flat grid, giving each cell a unique identifier with a 64-bit integer. Uber finds nearby drivers by identifying the relevant cells covering the rider's area and then listing the drivers in those cells sorted by the estimated time of arrival (ETA). H3 offers the benefits of a hierarchical indexing system and a hexagonal grid system.

featured in #477

uVitals – An Anomaly Detection & Alerting System

- Venki Appiah Komal Raulkar

Scale
Architecture

tl;dr: "Every day, millions of people rely on Uber to move from place to place and have food and groceries delivered. Uber depends on the reliability of its internal systems and the accuracy of data to power its platform. A glitch in its systems can result in a poor user experience and/or a loss in revenue. Major system issues that affect the reliability of our services are detected and mitigated quickly. However, there are several minor issues that take a longer time to detect and mitigate. Such minor issues can collectively result in poor user experiences and revenue loss over time. This is where uVitals comes in, as it surfaces these issues and anomalies when they begin to occur."

featured in #476

How Meta Built The Infrastructure For Threads

tl;dr: The article give examples of two existing components that played an important architectural role in building Threads: (1) ZippyDB, a distributed key/value datastore that provides scalability and flexibility across data centers. (2) Async, an asynchronous serverless function platform that processes trillions of function calls daily across over 100,000 servers. Async defers computing to off-peak hours, reducing time from solution conception to production deployment by handling deployment complexities i.e. queueing, scheduling, scaling, and disaster recovery. This allowed developers to focus on business logic.

featured in #475

/Scale