/Scale

How Discord Reduced Websocket Traffic by 40%

- Austin Whyte tl;dr: “zstandard has gained enough traction to become a viable replacement for zlib. Zstandard offers higher compression ratios and shorter compression times and supports dictionaries: a way to preemptively exchange information about compressed content, further increasing compression ratios and reducing the overall bandwidth usage.”

featured in #552


The Sneaky Costs Of Scaling Serverless

- Zach Leatherman tl;dr: “I decided to take the plunge and migrate my site elsewhere, mostly to see what it would really cost. I learned a few things along the way (and made a few mistakes) — hopefully writing them up can help you save some money on your hosting bill, too.” 

featured in #541


Serving A Billion Web Requests With Boring Code

- Bill Mill tl;dr: “I worked on this system for about two and a half years, from the very first commit through two open enrollment periods. The API system served about 5 million requests on a normal weekday, with < 10 millisecond average request latency and a 95th percentile latency of less than 100 milliseconds.”

featured in #528


Personalized Marketing at Scale: Uber’s Out-of-App Recommendation System

tl;dr: "Out-of-app (OOA) communication (such as email, push, and SMS) is an important growth lever at Uber. It allows marketers, product owners, and operation teams to connect with users on a plethora of topics, including user promotions, new and favorite restaurants, etc. Building a system to personalize these communications presents unique and exciting challenges. In this blog post, we walk through these challenges and our journey in tackling them."

featured in #527


How Meta Trains Large Language Models At Scale

tl;dr: “Our AI model training has involved a training massive number of models that required a comparatively smaller number of GPUs. This was the case for our recommendation models that would ingest vast amounts of information to make accurate recommendations that power most of our products. With the advent of generative AI, we’ve seen a shift towards fewer jobs, but incredibly large ones. Supporting GenAI at scale has meant rethinking how our software, hardware, and network infrastructure come together.”

featured in #525


How Zapier Automates Billions Of Tasks

- Neo Kim tl;dr: Neo takes a look at Zapier's architecture, highlighting its use of Nginx, Python Django, MySQL, Redis, AWS Lambda, RabbitMQ, and Celery for automating billions of tasks. It details Zapier's tech stack, asynchronous processing, scalability strategies, and how they handle task execution and history tracking, using technologies like GraphQL, Next.js, AWS S3, Kafka, and Elasticsearch for efficiency and scalability. 

featured in #493


1.5+ Million PDFs In 25 minutes

- Sarat Chandra Karan Sharma tl;dr:  “In this blog post, we describe our journey of building an architecture from scratch which now enables us to process, generate, digitally sign, and e-mail out 1.5+ million PDF contract notes in about 25 minutes, incurring only negligible costs. We self-host all elements of this architecture relying on raw EC2 instances for compute and S3 for ephemeral storage. In addition, the concepts used for orchestration of this particular workflow can now be used for orchestrating many different kinds of distributed jobs within our infrastructure.”

featured in #492


Scaling ChatGPT: Five Real-World Engineering Challenges

- Gergely Orosz Evan Morikawa tl;dr: An interview with Evan Morikawa, who led the OpenAI Applied Engineering team as ChatGPT launched and scaled. Evan reveals the five engineering challenges along with lessons learned. Challenges are: (1) KV Cache & GPU RAM. (2) Optimizing batch size. (3) Finding the right metrics to measure. (4) Finding GPUs wherever they are. (5) Inability to autoscale.  

featured in #491


Ledger: Stripe’s System For Tracking And Validating Money Movement

- Ilya Ganelin tl;dr: “Ledger models internal data-producing systems with common patterns, and it relies on proactive alerting to surface issues and proposed solutions. Each day, Ledger sees five billion events and 99.99% of our dollar volume is fully ingested and verified within four days. Of that activity, 99.999% is monitored, categorized, and triaged through rich investigative tooling — while the remaining long-tail is reliably handled through manual analysis.” This post shares technical details on how Stripe built this money movement tracking system, and how teams interact with the data quality metrics that underlie our global payments network.

featured in #490


How Disney+ Hotstar Delivered 5 Billion Emojis in Real Time

- Neo Kim tl;dr: This post outlines how Disney+ Hotstar delivered billions of emojis in real-time during the cricket World Cup in India to create a more engaging live experience. The post described how emojis were received from clients, processed and delivered at scale.

featured in #489