You Don’t Have To Sacrifice Streaming Data Performance To Cut Cloud Costs
tl;dr: Redpanda is faster and more efficient than Apache Kafka… but how much faster exactly? We ran 200+ hours of benchmarks to find out how both platforms perform for various workloads and hardware configurations. Here’s our breakdown on how Redpanda achieves 10x the performance while cutting cloud spend by over $500k.featured in #407
You Don’t Have To Sacrifice Streaming Data Performance To Cut Cloud Costs
tl;dr: Redpanda is faster and more efficient than Apache Kafka… but how much faster exactly? We ran 200+ hours of benchmarks to find out how both platforms perform for various workloads and hardware configurations. Here’s our breakdown on how Redpanda achieves 10x the performance while cutting cloud spend by over $500k.featured in #405
Balancing Quality And Coverage With Our Data Validation Framework
- Alexey Sanko tl;dr: Dropbox had a data validation problem, and this post discusses how they implemented a new quality check system in their big data pipelines that achieves a “balance of simplicity and coverage - providing good quality data, without being needlessly difficult or expensive to maintain.”featured in #397
featured in #390
SQL Should Be Your Default Choice For Data Engineering Pipelines
- Robin Linacre tl;dr: "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes."featured in #387
How DoorDash Secures Data Transfer Between Cloud And On-Premise Data Centers
- Roger Zeng tl;dr: "In this post, we will discuss how we established a secure, stable, and resilient private network connection between DoorDash microservices and our vendor’s on-premise data centers by leveraging the network facilities from our cloud provider, AWS."featured in #384
McDonald’s Event-Driven Architecture: The Data Journey And How It Works
- Vamshi Krishna Komuravalli Damian Sullivan tl;dr: Here is a typical data flow of how events are reliably produced and consumed from the platform: (1) Initially, an event schema is defined and registered in the schema registry. (2) Applications that need to produce events leverage producer SDK to publish events. (3) When an application starts up, an event schema is cached in the producing application for high performance. The authors continue to discuss how data flows through the system.featured in #380
Reverse Engineering TikTok's VM Obfuscation (Part 1)
tl;dr: "The platform has implemented various methods to make it difficult for reverse-engineers to understand exactly what data is being collected and how it is being used. Analyzing the call stack of a request made on tiktok can begin to paint the picture for us."featured in #379
The Difficult Life Of The Data Lead
- Mikkel Dengsøe tl;dr: "My take on what’s the most common root cause for the strain on data managers, is that it’s most often with stakeholders. They are not deliberately being difficult (I hope) and often have good intentions to push for their own business goals. But many stakeholders don’t know how to work with data people. In high-growth companies you often have stakeholders coming from all kinds of backgrounds." Mikkel elaborates in this post.featured in #353
Stop Aggregating Away The Signal In Your Data
- Zan Armstrong tl;dr: "Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context so that you’re not even aware of how much potential insight you’ve lost. In this article, I’ll start by discussing how aggregation can be problematic, before walking through three specific alternatives to aggregation with before / after examples."featured in #339