/Infrastructure

Faster Continuous Integration Builds At Canva

tl;dr: In April 2022, the average time for a PR to pass continuous integration and merge into our main branch was around 80 minutes. As shown in the following diagram, we’re now getting our build times down below 30 minutes, as low as 15 minutes. This post shares what we’ve done to improve CI build times in our main code repository, including: (1) Finding the best opportunities (2) Experimentation (3) Deliver fast and incrementally (4) The importance of everyone’s contributions. 

featured in #537


Google Zanzibar For The Rest Of Us

- Greg Sarjeant tl;dr: Google Zanzibar powers authorization for hundreds of Google’s apps so you might think it's a great model for your authorization service. But does Zanzibar's promises of scale, high availability, strong consistency mean that it’s the right solution for the rest of us? Zanzibar's defining characteristic is actually centralization, which is a massive tradeoff that’s not practical for most. The Googles of the world can pull it off, but is there a Zanzibar for the rest of us?

featured in #497


Google Zanzibar For The Rest Of Us

tl;dr: Google Zanzibar powers authorization for hundreds of Google’s apps so you might think it's a great model for your authorization service. But does Zanzibar's promises of scale, high availability, strong consistency mean that it’s the right solution for the rest of us? Zanzibar's defining characteristic is actually centralization, which is a massive tradeoff that’s not practical for most. The Googles of the world can pull it off, but is there a Zanzibar for the rest of us?

featured in #492


Google Zanzibar For The Rest Of Us

tl;dr: Google Zanzibar powers authorization for hundreds of Google’s apps so you might think it's a great model for your authorization service. But does Zanzibar's promises of scale, high availability, strong consistency mean that it’s the right solution for the rest of us? Zanzibar's defining characteristic is actually centralization, which is a massive tradeoff that’s not practical for most. The Googles of the world can pull it off, but is there a Zanzibar for the rest of us?

featured in #490


(Almost) Every Infrastructure Decision I Endorse Or Regret After 4 Years Running Infrastructure At A Startup

tl;dr: “I’ve led infrastructure at a startup for the past 4 years that has had to scale quickly. From the beginning I made some core decisions that the company has had to stick to, for better or worse, these past four years. This post will list some of the major decisions made and if I endorse them for your startup, or if I regret them and advise you to pick something else.”

featured in #488


Switching Build Systems, Seamlessly

- Patrick Balestra tl;dr: Patrick chronicles Spotify's shift to Bazel. The move was driven by the need for a scalable build system for their growing codebase. The transition, which began in earnest in 2020, involved running two build systems side by side, adapting existing tools, and extensive testing. By 2023, the iOS Spotify app was fully built with Bazel, resulting in significant improvements in build times and developer experience.

featured in #461


Automating Dead Code Cleanup

tl;dr: "SCARF contains a subsystem that automatically identifies dead code through a combination of static, runtime, and application analysis. It leverages this analysis to submit change requests to remove this code from our systems. This automated dead code removal improves the quality of our systems and also unblocks unused data removal in SCARF when the dead code includes references to data assets that prevent automated data cleanup. "

featured in #460


How GitHub Indexes Code For Blazing Fast Search & Retrieval

- Shivang Sarawagi tl;dr: “The search engine supports global queries across 200 million repos and indexes code changes in repositories within minutes. The code search index is by far the largest cluster that GitHub runs, comprising 5184 vCPUs, 40TB of RAM, and 1.25PB of backing storage, supporting a query load of 200 requests per second on average and indexing over 53 billion source files.”

featured in #458


Executing Cron Scripts Reliably At Scale

- Claire Adams tl;dr: Claire discusses the challenges of managing and executing cron scripts in a reliable manner within large-scale infrastructure. “The Job Queue is an asynchronous compute platform that runs about 9 billion “jobs” or pieces of work per day.“ Claire provides insights into techniques such as distributed execution, retries, and monitoring to ensure the dependable execution of cron jobs at scale, highlighting the need for a systematic approach to handle failures effectively.

featured in #457


How Clerk Rolls Infra For Auth & User Management

tl;dr: The complex infrastructure required to build and operate an authentication system. It contrasts self-hosted authentication, where developers manage the infrastructure, with a hosted authentication solution, where all auth-related responsibilities are delegated to a specialized service. The article details the components and integrations involved, including the use of cloud services and third-party platforms like Sendgrid and Twilio. The hosted system ensures secure, scalable, and responsive authentication, with options for developers to bring their own infrastructure or configuration.

featured in #441