Most LLM evals ask 'did it get the right answer?' Contrastive pair testing asks the harder question: 'can it tell two similar cases apart?'
2026-03-31 · 12 min read