Benchmarks 8 items

Everything Benchmarks

💬 Reddit Apr 22

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

Pairing Qwen3.6-35B with the 'little-coder' agent scaffold achieves 78.7% on the Polyglot coding benchmark, landing in the public top 10 and competitive with leading cloud models. The same scaffold previously lifted a 9B Qwen model from 19.11% to 45.56%, suggesting a significant portion of the local-vs-cloud performance gap is attributable to scaffold/harness mismatch rather than model capability alone.

Agents Benchmarks Code Gen Open Weights

📑 arXiv Apr 22

Preference Leakage: A Contamination Problem in LLM-as-a-Judge (ICLR 2026)

Identifies 'preference leakage': when the same LLM generates synthetic training data and serves as the judge, it systematically inflates scores for outputs matching its own generation style, biasing leaderboard rankings even when models perform similarly. Demonstrated empirically across several evaluation pipelines. A concrete warning against self-referential LLM-as-a-judge setups.

Evaluation Benchmarks Safety

🤗 Hugging Face Apr 22

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillLearnBench is the first benchmark for continual skill learning in LLM agents, covering 20 verified tasks across 15 sub-domains with evaluation at three levels: skill quality, execution trajectory, and task outcome. Tested methods include one-shot learning, self/teacher feedback, and skill-creator approaches; all improve over the no-skill baseline but none achieves consistent gains across domains. Highlights that automatic skill acquisition for agents remains an unsolved problem despite recent progress.

Agents Benchmarks Evaluation Continual-learning

💬 Reddit Apr 16

⭐ Editor's Pick

Opus 4.7 is 50% more expensive with context regression?!

User benchmarks show Claude Opus 4.7 scoring 59.2% vs Opus 4.6's 91.9% on the MRCR v2 8-needle 256K context benchmark — a sharp context retention regression. Compounding the issue, a tokenizer change reportedly causes Opus 4.7 to consume ~1.35x more tokens than Opus 4.6 and ~2x more than competing proprietary models, effectively raising costs ~50% for equivalent workloads. If the benchmark numbers hold, this is a meaningful quality-cost tradeoff moving in the wrong direction.

Models Benchmarks Inference Evaluation

🔶 Anthropic Apr 16

⭐ Editor's Pick

Introducing Claude Opus 4.7

Anthropic's official Claude Opus 4.7 GA post confirms same pricing as 4.6, image resolution raised to 2,576px long edge (~3.75 MP, 3× prior), and a new xhigh effort tier. Coding benchmarks: +13% task resolution on internal 93-task harness, 70% on CursorBench (vs. 58%), 98.5% on XBOW visual-acuity (vs. 54.5%). First model shipped with real-time cyber safeguards derived from the restricted Mythos Preview testbed.

Models Benchmarks Multimodal Safety

📝 Blog Apr 14

r/LocalLLaMA April 2026 community consensus: Qwen 3.5 most recommended family; Qwen3-Coder-Next sweeps local coding

April 2026 r/LocalLLaMA community consensus (143+ posts) names Qwen 3.5 as the most broadly recommended local model family, with Qwen3-Coder-Next as the near-unanimous pick for coding. MiniMax M2.5/M2.7 surface as the go-to for agentic/tool-heavy workloads; Gemma 4 gains traction for general local use; GLM-5/4.7 enters the best-overall conversation.

Open Weights Models Benchmarks Agents

📝 Blog Feb 25

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026

Architecture survey comparing 10 open-weight LLM releases from January–February 2026, with fact sheets and diagrams covering attention design, MoE structure, context length, and post-training approaches. Useful index for base model selection decisions going into Q1 2026 fine-tuning or deployment work.

Open Weights Models Benchmarks Fine-tuning

🟧 Hacker News Feb 12

HN: "Improving 15 LLMs at Coding in One Afternoon — Only the Harness Changed"

A 668-point HN post documents that swapping the evaluation harness — without changing any model — improved measured coding performance across 15 LLMs in an afternoon. Directly implicates harness sensitivity as a major confounder in coding benchmark results. High-signal for anyone designing or interpreting code evals.

Benchmarks Evaluation Code Gen