/Yury Zemlyanskiy

You Make Your Evals, Then Your Evals Make You. tl;dr: The post introduces AugmentQA, a benchmark for evaluating code retrieval systems using real-world software development scenarios rather than synthetic problems. AugmentQA uses codebases, developer questions, and keyword-based evaluation outperforming open-source models that excel on synthetic benchmarks but struggle with realistic tasks.

featured in #603