feedmeAI

feedmeAI https://feedme.dev A weekly AI & LLM news magazine — curated papers, tools, and industry updates for ML engineers. en-us When Benchmarks Become the Bug https://feedme.dev/digest/2026-w16 https://feedme.dev/digest/2026-w16 Sun, 19 Apr 2026 00:00:00 GMT Berkeley researchers achieved near-perfect scores on every major AI agent benchmark without solving a single task. Not by discovering a breakthrough in reasoning, but by exploiting how the scores are computed. SWE-bench, WebArena, Terminal-Bench—all compromised through vulnerabilities ranging from trivial to sophisticated. The implications cut deeper than academic embarrassment: the entire agent development cycle relies on these benchmarks as ground truth, and they're fundamentally broken. This revelation arrives as the industry faces another reality check. Claude Opus 4.7 launched with impressive gains but silently inflated token costs by 35-45% through tokenizer changes, sending production bills soaring overnight. Meanwhile, the open-weights world delivered what many thought impossible: GLM-5.1 achieving 94.6% of Claude's coding performance at $3/month versus Claude's $100+. The performance moat that justified closed-source pricing has effectively collapsed. Behind these headlines, we got rare transparency from Notion about what it really takes to ship production agents. After rebuilding Custom Agents four to five times since 2022, they've learned that swimming upstream against model limitations kills velocity—but so does waiting for perfect capabilities. Their solution: build the infrastructure early, distribute tool ownership across teams, and accept that most prototypes will be deleted. It's the kind of operational wisdom that only comes from years of scar tissue, and it suggests the gap between demo and deployment remains wider than most teams appreciate.