🍡 feedmeAI
← All topics
Benchmarks 8 items

Everything Benchmarks

💬 Reddit Apr 22

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

Pairing Qwen3.6-35B with the 'little-coder' agent scaffold achieves 78.7% on the Polyglot coding benchmark, landing in the public top 10 and competitive with leading cloud models. The same scaffold previously lifted a 9B Qwen model from 19.11% to 45.56%, suggesting a significant portion of the local-vs-cloud performance gap is attributable to scaffold/harness mismatch rather than model capability alone.

📑 arXiv Apr 22

Preference Leakage: A Contamination Problem in LLM-as-a-Judge (ICLR 2026)

Identifies 'preference leakage': when the same LLM generates synthetic training data and serves as the judge, it systematically inflates scores for outputs matching its own generation style, biasing leaderboard rankings even when models perform similarly. Demonstrated empirically across several evaluation pipelines. A concrete warning against self-referential LLM-as-a-judge setups.

🤗 Hugging Face Apr 22

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillLearnBench is the first benchmark for continual skill learning in LLM agents, covering 20 verified tasks across 15 sub-domains with evaluation at three levels: skill quality, execution trajectory, and task outcome. Tested methods include one-shot learning, self/teacher feedback, and skill-creator approaches; all improve over the no-skill baseline but none achieves consistent gains across domains. Highlights that automatic skill acquisition for agents remains an unsolved problem despite recent progress.

💬 Reddit Apr 16
⭐ Editor's Pick

Opus 4.7 is 50% more expensive with context regression?!

User benchmarks show Claude Opus 4.7 scoring 59.2% vs Opus 4.6's 91.9% on the MRCR v2 8-needle 256K context benchmark — a sharp context retention regression. Compounding the issue, a tokenizer change reportedly causes Opus 4.7 to consume ~1.35x more tokens than Opus 4.6 and ~2x more than competing proprietary models, effectively raising costs ~50% for equivalent workloads. If the benchmark numbers hold, this is a meaningful quality-cost tradeoff moving in the wrong direction.

🔶 Anthropic Apr 16
⭐ Editor's Pick

Introducing Claude Opus 4.7

Anthropic's official Claude Opus 4.7 GA post confirms same pricing as 4.6, image resolution raised to 2,576px long edge (~3.75 MP, 3× prior), and a new xhigh effort tier. Coding benchmarks: +13% task resolution on internal 93-task harness, 70% on CursorBench (vs. 58%), 98.5% on XBOW visual-acuity (vs. 54.5%). First model shipped with real-time cyber safeguards derived from the restricted Mythos Preview testbed.

📝 Blog Apr 14

r/LocalLLaMA April 2026 community consensus: Qwen 3.5 most recommended family; Qwen3-Coder-Next sweeps local coding

April 2026 r/LocalLLaMA community consensus (143+ posts) names Qwen 3.5 as the most broadly recommended local model family, with Qwen3-Coder-Next as the near-unanimous pick for coding. MiniMax M2.5/M2.7 surface as the go-to for agentic/tool-heavy workloads; Gemma 4 gains traction for general local use; GLM-5/4.7 enters the best-overall conversation.