Linear Digressions
Benchmarking AI Models
- Autor: Vários
- Narrador: Vários
- Editor: Podcast
- Duración: 0:29:55
- Mas informaciones
Informações:
Sinopsis
How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.