featured in #584
How Discord Reduced Websocket Traffic by 40%
- Austin Whyte tl;dr: “zstandard has gained enough traction to become a viable replacement for zlib. Zstandard offers higher compression ratios and shorter compression times and supports dictionaries: a way to preemptively exchange information about compressed content, further increasing compression ratios and reducing the overall bandwidth usage.”featured in #552
The Sneaky Costs Of Scaling Serverless
- Zach Leatherman tl;dr: “I decided to take the plunge and migrate my site elsewhere, mostly to see what it would really cost. I learned a few things along the way (and made a few mistakes) — hopefully writing them up can help you save some money on your hosting bill, too.”featured in #541
Serving A Billion Web Requests With Boring Code
- Bill Mill tl;dr: “I worked on this system for about two and a half years, from the very first commit through two open enrollment periods. The API system served about 5 million requests on a normal weekday, with < 10 millisecond average request latency and a 95th percentile latency of less than 100 milliseconds.”featured in #528
Personalized Marketing at Scale: Uber’s Out-of-App Recommendation System
tl;dr: "Out-of-app (OOA) communication (such as email, push, and SMS) is an important growth lever at Uber. It allows marketers, product owners, and operation teams to connect with users on a plethora of topics, including user promotions, new and favorite restaurants, etc. Building a system to personalize these communications presents unique and exciting challenges. In this blog post, we walk through these challenges and our journey in tackling them."featured in #527
How Meta Trains Large Language Models At Scale
tl;dr: “Our AI model training has involved a training massive number of models that required a comparatively smaller number of GPUs. This was the case for our recommendation models that would ingest vast amounts of information to make accurate recommendations that power most of our products. With the advent of generative AI, we’ve seen a shift towards fewer jobs, but incredibly large ones. Supporting GenAI at scale has meant rethinking how our software, hardware, and network infrastructure come together.”featured in #525
How Zapier Automates Billions Of Tasks
- Neo Kim tl;dr: Neo takes a look at Zapier's architecture, highlighting its use of Nginx, Python Django, MySQL, Redis, AWS Lambda, RabbitMQ, and Celery for automating billions of tasks. It details Zapier's tech stack, asynchronous processing, scalability strategies, and how they handle task execution and history tracking, using technologies like GraphQL, Next.js, AWS S3, Kafka, and Elasticsearch for efficiency and scalability.featured in #493
1.5+ Million PDFs In 25 minutes
- Sarat Chandra Karan Sharma tl;dr: “In this blog post, we describe our journey of building an architecture from scratch which now enables us to process, generate, digitally sign, and e-mail out 1.5+ million PDF contract notes in about 25 minutes, incurring only negligible costs. We self-host all elements of this architecture relying on raw EC2 instances for compute and S3 for ephemeral storage. In addition, the concepts used for orchestration of this particular workflow can now be used for orchestrating many different kinds of distributed jobs within our infrastructure.”featured in #492
Scaling ChatGPT: Five Real-World Engineering Challenges
- Gergely Orosz Evan Morikawa tl;dr: An interview with Evan Morikawa, who led the OpenAI Applied Engineering team as ChatGPT launched and scaled. Evan reveals the five engineering challenges along with lessons learned. Challenges are: (1) KV Cache & GPU RAM. (2) Optimizing batch size. (3) Finding the right metrics to measure. (4) Finding GPUs wherever they are. (5) Inability to autoscale.featured in #491
Ledger: Stripe’s System For Tracking And Validating Money Movement
- Ilya Ganelin tl;dr: “Ledger models internal data-producing systems with common patterns, and it relies on proactive alerting to surface issues and proposed solutions. Each day, Ledger sees five billion events and 99.99% of our dollar volume is fully ingested and verified within four days. Of that activity, 99.999% is monitored, categorized, and triaged through rich investigative tooling — while the remaining long-tail is reliably handled through manual analysis.” This post shares technical details on how Stripe built this money movement tracking system, and how teams interact with the data quality metrics that underlie our global payments network.featured in #490