/Scale

Data-Caching Techniques For 1.2 Billion Daily API Requests

- Guillermo Pérez tl;dr: The cache needs to achieve three things: (1) Low latency: It needs to be fast. If a cache server has issues, you can’t retry. (2) Up and warm: It needs to hold the majority of the critical data. If you lose it, it would surely bring down the backend systems with too much load. (3) Consistency: It should never hold stale or incorrect data. “A lot of the techniques mentioned in this article are supported by our open source meta-memcache cache client.”

featured in #486


Meta's Serverless Platform Processes Trillions Of Function Calls A Day

- Leonardo Creed tl;dr: Meta’s XFaaS is their serverless platform that processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions. XFaaS is Meta’s internal version of public Function-as-a-Service (FaaS) options, such as AWS Lambda, Google Cloud Functions, and Azure Functions. Leonardo shares his high-level takeaways and lessons and then a more detailed walkthrough about the architecture behind XFaaS. 

featured in #485


Sensenmann: Code Deletion At Scale

- Phil Norman tl;dr: “What if we could clean up dead code automatically? That was exactly what people started thinking several years ago, during the Zürich Engineering Productivity team's annual hackathon. The Sensenmann project, named after the German word for the embodiment of Death, has been highly successful. It submits over 1000 deletion changelists per week, and has so far deleted nearly 5% of all C++ at Google. Its goal is simple (at least, in principle): automatically identify dead code, and send code review requests to delete it.” Phil discusses its logic. 

featured in #482


How Apple Built iCloud To Store Billions Of Databases

- Leonardo Creed tl;dr: Apple uses FoundationDB and Cassandra for iCloud and CloudKit, their cloud backend service, running one of the largest Cassandra deployments in the world. This deployment includes over 300,000 instances or nodes, managing hundreds of petabytes of data, possibly extending to exabytes. Each cluster in this deployment can handle over two petabytes of data, and there are thousands of such clusters. This indicates a highly distributed and scalable storage system spread across multiple data centers. 

featured in #480


How Discord Serves 15-Million Users On One Server

- Alex Xu tl;dr: "Internally, each Discord community is called a “guild”. A dedicated Elixir “guild process” handles coordination and routing for each guild. This tracks all connected users to the guild. Every online user has a separate Elixir "session process”. When the guild process gets a new message, event, or update, it fans out this information to the relevant session processes. These session processes then push the update over WebSocket to the Discord clients."

featured in #479


How Uber Finds Nearby Drivers At 1 Million Requests Per Second

tl;dr: H3 is a hexagonal-shaped hierarchical geospatial indexing system created at Uber dividing Earth’s surface into cells on a flat grid, giving each cell a unique identifier with a 64-bit integer. Uber finds nearby drivers by identifying the relevant cells covering the rider's area and then listing the drivers in those cells sorted by the estimated time of arrival (ETA). H3 offers the benefits of a hierarchical indexing system and a hexagonal grid system.

featured in #477


uVitals – An Anomaly Detection & Alerting System

- Venki Appiah Komal Raulkar tl;dr: "Every day, millions of people rely on Uber to move from place to place and have food and groceries delivered. Uber depends on the reliability of its internal systems and the accuracy of data to power its platform. A glitch in its systems can result in a poor user experience and/or a loss in revenue. Major system issues that affect the reliability of our services are detected and mitigated quickly. However, there are several minor issues that take a longer time to detect and mitigate. Such minor issues can collectively result in poor user experiences and revenue loss over time. This is where uVitals comes in, as it surfaces these issues and anomalies when they begin to occur."

featured in #476


How Meta Built The Infrastructure For Threads

tl;dr: The article give examples of two existing components that played an important architectural role in building Threads: (1) ZippyDB, a distributed key/value datastore that provides scalability and flexibility across data centers. (2) Async, an asynchronous serverless function platform that processes trillions of function calls daily across over 100,000 servers. Async defers computing to off-peak hours, reducing time from solution conception to production deployment by handling deployment complexities i.e. queueing, scheduling, scaling, and disaster recovery. This allowed developers to focus on business logic. 

featured in #475


Data Quality Score: The Next Chapter Of Data Quality At Airbnb

- Clark Wright tl;dr: "With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb’s growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners. Weekly metric reports were difficult to land on time. Seemingly basic metrics like “Active Listings” relied on a web of upstream dependencies. Conducting meaningful data work required significant institutional knowledge to overcome hidden caveats in our data." Clark discusses the implementation of a Data Quality Score.

featured in #471


How Uber Computes ETA At Half A Million Requests Per Second

tl;dr: "A single trip usually takes around 1000 ETA requests.Yet computing ETA is a difficult problem. Because the distance between the source and destination is not a straight line. Instead it consists of complex street networks and highways." Engineers split a route into smaller partitions to find the shortest path amongst each partition, factoring in variables, such as traffic.

featured in #470