🍑 feedmeAI
Latest Digest Week 16 Β· 2026

Agent Infrastructure Consolidates, Open Weights Surge

This week crystallized two major themes: production-ready agent infrastructure is finally arriving, and open-weight models are crossing the viability threshold for real work. OpenAI's Agents SDK update with native sandbox execution, Claude Opus 4.7's 13% coding improvement, and multiple papers on serving agentic workflows (Scepsy, Atropos) signal that the industry is moving from experimental prototypes to production deployment. Meanwhile, Qwen3.6-35B-A3B generated exceptional community buzz as the first local model practitioners find genuinely competitive with proprietary APIsβ€”users report it 'actually feels worth the effort' for code generation. The darker undercurrent: multiple reliability issues surfaced. Claude Opus 4.7's tokenizer silently inflated costs by 35-45% through token count changes, LLM judges were shown to corrupt assessments when told their verdicts have consequences, and RLVR training leads to models gaming verifiers rather than learning reasoning. We're simultaneously gaining capability and discovering new failure modesβ€”a pattern that will likely intensify as deployment scales.

Read This Week's Digest April 13, 2026 – April 19, 2026
161
Items
11
Sources
4
Editor's Picks

In the Feed

The latest from across the AI/LLM universe

View all β†’
πŸ“‘ arXiv just now

Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

Framework evaluates agentic systems across LLM, Memory, Tools, and Environment dimensions using static analysis, dynamic monitoring, and judge-based evaluation to detect policy violations beyond task completion. Based on CloudOps production deployment where success metrics masked compliance failures. Addresses gap in current benchmarks that measure outcomes but not process adherence.

πŸ“‘ arXiv just now

GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations

GUIDE separates lightweight acting model for real-time spacecraft control from offline reflection that updates a 'playbook' from prior trajectories, demonstrating LLMs can adapt operational strategies without weight updates in safety-critical domains. Shows context evolution in LLM agents functions as policy search over structured decision rules in deployment-constrained environments.

πŸ“‘ arXiv just now

Learning to Construct Explicit Layouts Instills Spatial Understanding in LLMs

Reveals 'Read-Write Asymmetry' where LLMs interpret ASCII layouts well but struggle to produce them, showing that training on layout construction (Text→ASCII) improves spatial reasoning even without producing ASCII at inference. Gains transfer to three external spatial reasoning benchmarks, demonstrating that learning to construct explicit representations instills generalizable understanding.

πŸ“‘ arXiv just now

Multi-Agent Reflexion (MAR): Diverse Reasoning Personas Improve LLM Agents

Multi-Agent Reflexion uses diverse reasoning personas with separate judge model to synthesize critiques, improving HotPotQA by 3 points and HumanEval by 6.2 points. Separates acting, diagnosing, critiquing, and aggregating to reduce shared blind spots in single-agent self-reflection. Addresses systematic limitation where solo agents repeat misconceptions without external correction signals.