Data Quality Score: The Next Chapter Of Data Quality At Airbnb
- Clark Wright tl;dr: "With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb’s growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners. Weekly metric reports were difficult to land on time. Seemingly basic metrics like “Active Listings” relied on a web of upstream dependencies. Conducting meaningful data work required significant institutional knowledge to overcome hidden caveats in our data." Clark discusses the implementation of a Data Quality Score.featured in #471
Clerk Webhooks: Data Sync with Convex
- Dev Agrawal tl;dr: “Composing an application out of multiple data sources can be challenging, and while Clerk’s Backend APIs work great for most use cases, Webhooks offer the next level of integration that enables you to take full advantage of your existing stack. This post will cover how to synchronize user data from Clerk into your own backend using Webhooks.”featured in #470
The Ultimate Guide To Modernizing Your Data Import Solution
tl;dr: When should you invest in modernizing your tech stack to drive long-term success? Discover crucial business milestones that signal the need for a tech upgrade and learn how to evaluate alternative solutions.featured in #465
From Big Data To Better Data: Ensuring Data Quality With Verity
- Michael McPhillips tl;dr: Michael emphasizes that "data quality is paramount for accurate insights," highlighting the challenge of ensuring data reliability. Michael introduces Lyft’s in-house data quality platform, Verity, which has an exhaustive flow that starts with the following steps: (1) Data Profiling: Incoming data is scrutinized for its structure, schema, and content. This allows it to identify potential anomalies and inconsistencies. (2) Customizable Rules Engine: Enables data experts to define specific data quality rules tailored to their unique needs. These rules encompass everything from data format validations to more intricate domain-specific checks. (3) Automated Quality Checks: Once the rules are set, they are applied to incoming data streams, scanning each data point, seeking discrepancies.featured in #457
Best Practices For Collecting And Querying Data From Multiple Sources
- Zoe Steinkamp tl;dr: In a data-centric era, efficiently collecting and querying data from diverse sources is paramount. Zoe Steinkamp emphasizes the importance of best practices in data collection, such as optimizing ingestion pipelines and advanced querying. With varied data streams like IoT and cloud computing, single-database storage is outdated. Instead, strategies like effective data modeling and understanding data sources are vital. Tools like InfluxDB, a time series database, and Pandas, a Python library, facilitate data management and analysis. Leveraging multiple data sources optimizes cost, efficiency, and user experience.featured in #449
You Don’t Have To Sacrifice Streaming Data Performance To Cut Cloud Costs
tl;dr: Redpanda is faster and more efficient than Apache Kafka… but how much faster exactly? We ran 200+ hours of benchmarks to find out how both platforms perform for various workloads and hardware configurations. Here’s our breakdown on how Redpanda achieves 10x the performance while cutting cloud spend by over $500k.featured in #407
You Don’t Have To Sacrifice Streaming Data Performance To Cut Cloud Costs
tl;dr: Redpanda is faster and more efficient than Apache Kafka… but how much faster exactly? We ran 200+ hours of benchmarks to find out how both platforms perform for various workloads and hardware configurations. Here’s our breakdown on how Redpanda achieves 10x the performance while cutting cloud spend by over $500k.featured in #405
Balancing Quality And Coverage With Our Data Validation Framework
- Alexey Sanko tl;dr: Dropbox had a data validation problem, and this post discusses how they implemented a new quality check system in their big data pipelines that achieves a “balance of simplicity and coverage - providing good quality data, without being needlessly difficult or expensive to maintain.”featured in #397
featured in #390
SQL Should Be Your Default Choice For Data Engineering Pipelines
- Robin Linacre tl;dr: "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes."featured in #387