Essential Reading For Engineering Leaders

Data Quality Score: The Next Chapter Of Data Quality At Airbnb

- Clark Wright

Data
Scale

tl;dr: "With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb’s growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners. Weekly metric reports were difficult to land on time. Seemingly basic metrics like “Active Listings” relied on a web of upstream dependencies. Conducting meaningful data work required significant institutional knowledge to overcome hidden caveats in our data." Clark discusses the implementation of a Data Quality Score.

featured in #471

Clerk Webhooks: Data Sync with Convex

- Dev Agrawal

Data
Tools

tl;dr: “Composing an application out of multiple data sources can be challenging, and while Clerk’s Backend APIs work great for most use cases, Webhooks offer the next level of integration that enables you to take full advantage of your existing stack. This post will cover how to synchronize user data from Clerk into your own backend using Webhooks.”

featured in #470

The Ultimate Guide To Modernizing Your Data Import Solution

Guide
Data

tl;dr: When should you invest in modernizing your tech stack to drive long-term success? Discover crucial business milestones that signal the need for a tech upgrade and learn how to evaluate alternative solutions.

featured in #465

From Big Data To Better Data: Ensuring Data Quality With Verity

- Michael McPhillips

Scale
Data

tl;dr: Michael emphasizes that "data quality is paramount for accurate insights," highlighting the challenge of ensuring data reliability. Michael introduces Lyft’s in-house data quality platform, Verity, which has an exhaustive flow that starts with the following steps: (1) Data Profiling: Incoming data is scrutinized for its structure, schema, and content. This allows it to identify potential anomalies and inconsistencies. (2) Customizable Rules Engine: Enables data experts to define specific data quality rules tailored to their unique needs. These rules encompass everything from data format validations to more intricate domain-specific checks. (3) Automated Quality Checks: Once the rules are set, they are applied to incoming data streams, scanning each data point, seeking discrepancies.

featured in #457

Best Practices For Collecting And Querying Data From Multiple Sources

- Zoe Steinkamp

BestPractices
Data

tl;dr: In a data-centric era, efficiently collecting and querying data from diverse sources is paramount. Zoe Steinkamp emphasizes the importance of best practices in data collection, such as optimizing ingestion pipelines and advanced querying. With varied data streams like IoT and cloud computing, single-database storage is outdated. Instead, strategies like effective data modeling and understanding data sources are vital. Tools like InfluxDB, a time series database, and Pandas, a Python library, facilitate data management and analysis. Leveraging multiple data sources optimizes cost, efficiency, and user experience.

featured in #449

You Don’t Have To Sacrifice Streaming Data Performance To Cut Cloud Costs

tl;dr: Redpanda is faster and more efficient than Apache Kafka… but how much faster exactly? We ran 200+ hours of benchmarks to find out how both platforms perform for various workloads and hardware configurations. Here’s our breakdown on how Redpanda achieves 10x the performance while cutting cloud spend by over $500k.

featured in #407

You Don’t Have To Sacrifice Streaming Data Performance To Cut Cloud Costs

tl;dr: Redpanda is faster and more efficient than Apache Kafka… but how much faster exactly? We ran 200+ hours of benchmarks to find out how both platforms perform for various workloads and hardware configurations. Here’s our breakdown on how Redpanda achieves 10x the performance while cutting cloud spend by over $500k.

featured in #405

Balancing Quality And Coverage With Our Data Validation Framework

- Alexey Sanko

Data

tl;dr: Dropbox had a data validation problem, and this post discusses how they implemented a new quality check system in their big data pipelines that achieves a “balance of simplicity and coverage - providing good quality data, without being needlessly difficult or expensive to maintain.”

featured in #397

Big Data Is Dead

- Jordan Tigani

Data

tl;dr: This post will make the case that the era of Big Data is over. It had a good run, but now we can stop worrying about data size and focus on how we’re going to use it to make better decisions. I’ll show a number of graphs; these are all hand-drawn based on memory. If I did have access to the exact numbers, I wouldn’t be able to share them. But the important part is the shape, rather than the exact values.

featured in #390

SQL Should Be Your Default Choice For Data Engineering Pipelines

- Robin Linacre

SQL
Data

tl;dr: "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes."

featured in #387

/Data