/Scale

How To Survive Your Project's First 100,000 Lines

- Evan Ovadia tl;dr: The Vale compiler hit its 100,000th line of code - this article explains how it was kept from collapsing. “Some of these software engineering techniques came from my time at Google, though ironically most came from my work on the Vale compiler and game development so some of these might be surprising to my engineer comrades out there.” Techniques range from determinism, to testing, to type-system techniques, to general architectural best-practices.

featured in #411


How eBay Modernized The Most Important Page On Our Platform

tl;dr: “eBay’s View Item page lives at the center of our e-commerce platform. Our customers load this page over 250 million times each day, and stringent budgets on site speed and availability guarantee the quality of their experience. And yet, this page had its last intentional rewrite ten years ago.”

featured in #410


Tracing Notifications

- Suman Karumuri George Luong tl;dr: The engineering team at Slack embarked on a project to improve debugging notifications. “Debugging notification issues within our systems was difficult because each system had a different logging pipeline and data format, making it necessary to look at data with different formats and backends. This process required deep technical expertise and took several days to complete.”

featured in #407


Real-time Messaging

- Sameera Thangudu tl;dr: From the engineering team at Slack, “we’ll describe the architecture that we use to send real-time messages at scale. We’ll take a closer look at the services that send the chat messages and various events to these online users in real time.”

featured in #406


How Lyft Uses Load Testing To Ensure Reliable Service During Peak Events

- Remco Van Bree tl;dr: “We have come to realize that load testing in production is a powerful tool to prepare systems for unexpected bursty traffic and peak events. We’ll explore why Lyft needed a custom performance testing framework that worked in production, how we built a cross-functional solution, and how we’ve continued to improve this testing platform.”

featured in #404


Twitter's Recommendation Algorithm

tl;dr: Twitter recommendation algorithm distills roughly 500 million tweets posted daily down to a handful of top tweets that show up on your device’s, specifically for you. This blog is an introduction to how the algorithm works.

featured in #403


Pull The Andon Cord

- Taylor Pearson tl;dr: The Andon Cord was a rope that hung in Toyota factories that instantly could stop all work on the assembly line, which workers were encouraged to pull when they saw an issue. Once pulled, a manager came down to look the issue but the worker who pulled the rope was the one that came up with the solution. This process had 2 benefits: (1) It made workers feel trusted and part of the company’s output. (2) It dramatically increased quality as workers had a lot of tacit knowledge that managers didn’t.

featured in #401


Automating Safe, Hands-Off Deployments

- Clare Liguori tl;dr: “In this article, we walk through the steps a code change goes through in a pipeline at Amazon on its way to production. A typical continuous delivery pipeline has four major phases - source, build, test, and production. We’ll dive into the details of what happens in each of these pipeline phases for a typical AWS service, and provide you with an example of how a typical AWS service team might set up one of their pipelines.”

featured in #401


Database Sharding Explained

- Mahdi Yusuf tl;dr: Mahdi discusses when to use it, how it can be set up, why we shard data stores and various options you have before sharding.

featured in #401


How Discord Stores Trillions Of Messages

- Bo Ingram tl;dr: “Our Cassandra cluster exhibited serious performance issues that required increasing amounts of effort to just maintain, not improve.” Bo discusses the troubles with Cassandra and the migration to ScyllaDB, a Cassandra-compatible database written in C++.

featured in #396