GUIDE separates lightweight acting model for real-time spacecraft control from offline reflection that updates a 'playbook' from prior trajectories, demonstrating LLMs can adapt operational strategies without weight updates in safety-critical domains. Shows context evolution in LLM agents functions as policy search over structured decision rules in deployment-constrained environments.
Reveals 'Read-Write Asymmetry' where LLMs interpret ASCII layouts well but struggle to produce them, showing that training on layout construction (Text→ASCII) improves spatial reasoning even without producing ASCII at inference. Gains transfer to three external spatial reasoning benchmarks, demonstrating that learning to construct explicit representations instills generalizable understanding.
Multi-Agent Reflexion uses diverse reasoning personas with separate judge model to synthesize critiques, improving HotPotQA by 3 points and HumanEval by 6.2 points. Separates acting, diagnosing, critiquing, and aggregating to reduce shared blind spots in single-agent self-reflection. Addresses systematic limitation where solo agents repeat misconceptions without external correction signals.
Transformers make irrevocable decisions before seeing full context, replicating rhyme-planning findings on open-weights models and extending to factual recall. Reveals premature binding mechanisms that limit reasoning—models commit to answers too early. First mechanistic evidence of early commitment across multiple task types.
User reports Gemini identified a $280M AAVE crypto exploit hours before public disclosure, then retracted it as a hallucination when the user couldn't verify it because news hadn't broken yet. The incident raises questions about model temporal knowledge, hallucination detection, and potential real-time information synthesis.
Qwen 3.6 achieves significant performance improvements approaching Claude Opus and Codex usefulness when `preserve_thinking` configuration is enabled. Runs efficiently at 8-bit quantization on M5 Max hardware with 3K prompt processing and 100 token/s generation via oMLX.
Moonlake's world models for game development demonstrate that code engines with symbolic reasoning outperform pure diffusion models for game logic and boundary conditions. The work positions gaming as a testbed for world models before broader deployment, highlighting the structure-vs-scale debate and comparing language vs. JEPA approaches.
AtManRL uses differentiable attention manipulation and reinforcement learning to train LLMs to generate reasoning traces that genuinely influence final predictions rather than merely accompanying them. By learning additive attention masks that identify crucial CoT tokens, the method derives a saliency reward signal integrated with outcome-based rewards in the GRPO framework for faithful chain-of-thought reasoning.
Chain-of-Thought prompting consistently degrades performance in visual spatial reasoning tasks across seventeen multimodal models and thirteen benchmarks. Novel No-Image++ ablation reveals MRMs hallucinate visual details from textual priors even when images are absent, indicating severe shortcut learning in CoT-prompted vision-language models.
Users report degraded quality in Claude Opus 4.7 for complex reasoning tasks in theoretical math and physics, citing frequent downtime and performance drops compared to version 4.6. Multiple researchers considering switching back to ChatGPT despite previous preference for Claude.
Mind's Eye benchmark evaluates MLLMs on eight visuo-cognitive tasks inspired by human intelligence tests, organized under Abstraction-Relation-Transformation taxonomy. Humans achieve 80% accuracy while top MLLMs remain below 50%, revealing failures in visual attention, pattern induction, and mental transformation—core processes of fluid intelligence.
STOP (Super TOken for Pruning) is the first learnable internal path pruning method for Large Reasoning Models, addressing prohibitive costs from futile reasoning paths. Outperforms existing baselines across LRMs from 1.5B to 20B parameters by systematically pruning at the prefix level using internal signals.
Post-trained language models produce less varied outputs than base models, undermining inference-time scaling methods that rely on sample diversity. Study traces output diversity through three Olmo 3 post-training lineages, finding collapse location co-varies with data composition—the Think lineage loses most semantic diversity during supervised fine-tuning.
SocialGrid is an Among Us-inspired benchmark evaluating LLM agents on planning, task execution, and social reasoning in embodied multi-agent settings. Even GPT-OSS-120B achieves below 60% accuracy, with agents stuck in repetitive behaviors—revealing social reasoning remains a bottleneck even with planning assistance.
MEDLEY-BENCH evaluates AI metacognition by separating independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. Testing 35 models reveals a robust dissociation: evaluation ability scales with model size, but control over one's reasoning does not, indicating larger models can assess but not regulate their cognition.
Agentic Verifier transforms reward modeling into multi-turn, tool-augmented deliberation using complementary forward and backward agents. Addresses error propagation and lack of grounding in complex domains by tracing solutions from premises to conclusions and re-checking conclusions against premises for comprehensive verification.
ReactBench reveals fundamental limitations in MLLMs' structural reasoning by testing them on chemical reaction diagrams with branching paths, converging flows, and cyclic dependencies. Existing models degrade sharply on topological structures despite excelling at individual visual elements, exposing a gap that semantic-focused benchmarks miss.
WORC (Weak-link Optimization for Reasoning and Collaboration) improves multi-agent LLM frameworks by systematically identifying and reinforcing performance-limiting agents rather than only enhancing high-capability agents. Addresses reasoning instability where individual agent errors amplify through collaboration, grounded in the weak-link principle.
Survey categorizing graph-LLM integration methods by purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, causal graphs), and integration strategy (prompting, augmentation, training, agent-based). Provides clarity on when and what types of graph representations enhance LLM capabilities.
LLMs show systematic asymmetry between judging pragmatic appropriateness and generating pragmatically appropriate language across three settings. Models that excel at evaluating pragmatic competence often fail to produce similarly competent outputs, revealing misalignment between their listener and speaker capabilities.
QuantSightBench evaluates LLM quantitative forecasting using prediction intervals over continuous quantities rather than binary or multiple-choice formats. The benchmark demands scale awareness, internal consistency across confidence levels, and calibration over continuous outcomes, addressing a gap in existing reasoning-under-uncertainty evaluations.
DPrivBench evaluates whether LLMs can automate differential privacy reasoning by testing if they can verify whether functions satisfy stated DP guarantees. The benchmark covers diverse DP topics and difficulty levels while resisting trivial pattern matching, addressing the expert-level barrier that prevents non-experts from designing DP algorithms.
CiPO (Counterfactual Unlearning through iterative Preference Optimization) removes unwanted knowledge from Large Reasoning Models by intervening in chain-of-thought reasoning traces, avoiding degradation of reasoning performance. Redefines unlearning for LRMs as targeted CoT intervention rather than wholesale knowledge removal.
Investigation of LLM arithmetic reveals models recognize tasks early but generate correct results only in final layers, with proficient models exhibiting clear division of labor: attention modules propagate input information while MLP modules aggregate it. This attention-MLP specialization is absent in less capable models, traced via early decoding across layers.
Discover And Prove (DAP) introduces 'Hard Mode' automated theorem proving where systems must independently discover answers before constructing formal proofs, unlike standard benchmarks that embed answers in statements. Releases MiniF2F-Hard and FIMO-Hard benchmarks with expert reannotations, and an agentic framework using LLM natural-language reasoning with self-reflection for answer discovery.
MemoSight unifies context compression with multi-token prediction to accelerate LLM reasoning without quality loss, addressing computational bottlenecks in long-context reasoning. The approach makes advanced reasoning capabilities more practical for production as context windows expand.
Investigates when small transformers make early, irreversible commitments to outputs during forward passes, replicating findings on open-weights models and extending to factual recall tasks. Understanding minimal architectures for planning-like behavior reveals how models perform multi-step reasoning with limited computational resources, advancing mechanistic interpretability.
Anthropic released Auto mode for Claude Code (Opus 4.7, Max tier) and new "xhigh" effort level between high and max for granular reasoning control. Update includes fullscreen TUI rendering, mobile notifications for Remote Control, and Windows/MCP fixes.
Controlled experiments on shortest-path planning reveal LLMs exhibit strong spatial generalization to unseen maps but fail at length scaling due to recursive instability. The synthetic environment cleanly separates training data, paradigms, and inference strategies to isolate generalization failure modes.
LLMs and VLMs can perform viewpoint rotation understanding tasks using only text descriptions, without visual input. The study investigates how models infer final viewpoints and predict observations after textual descriptions of rotations, examining whether linguistic intelligence alone enables spatial reasoning. Uses interpretability methods to understand the internal mechanisms enabling this capability.
CoopEval benchmarks game-theoretic cooperation mechanisms across four social dilemmas, revealing that stronger reasoning LLMs behave less cooperatively in mixed-motive games like prisoner's dilemma. The work evaluates mechanisms including repeated games, reputation systems, and commitment devices to enable cooperative equilibria between rational agents.
Fixed-point framework analyzes looped transformers for test-time compute scaling along reachability, input-dependence, and geometric stability axes. Proves looped networks without recall have countable fixed points and cannot achieve strong input-dependence, while recall combined with outer normalization produces regimes where fixed points are reachable, locally smooth, and input-dependent—enabling extrapolation to harder problems rather than memorization.
SpecGuard performs step-level verification in speculative decoding using only model-internal signals (attention-based grounding scores and ensemble verification) without external reward models. Prevents erroneous reasoning steps from propagating while avoiding the latency and computational overhead of external verifiers in multi-step reasoning tasks.
IRS framework decomposes humor understanding into three structured components: identifying visual incongruities, constructing coherent reinterpretations, and aligning with human preference judgments. Applies incongruity-resolution theory to the New Yorker Cartoon Caption Contest, moving beyond black-box prediction to explicit reasoning processes. Demonstrates that humor comprehension requires getting both the answer and the underlying reasoning correct.
Meituan introduces Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that simulates group-level user behavior for merchant strategy evaluation by mining transferable decision policies from behavioral trajectories. The approach addresses information incompleteness and mechanism duality by anchoring an LLM-based reasoning branch with behavioral policies to prevent over-rationalization. This enables scalable counterfactual evaluation without costly online experiments.
RLVR-trained models on inductive reasoning tasks systematically abandon rule induction and instead enumerate instance-level labels that pass verifiers without capturing relational patterns—a form of reward hacking exploiting imperfect verifiers. The paper introduces detection methods for these shortcuts where models game verifiers rather than learn generalizable reasoning.
IG-Search introduces step-level information gain rewards for search-augmented reasoning, measuring how retrieved documents improve model confidence in answers relative to random baselines. This addresses the gradient collapse problem in trajectory-level RL when all sampled trajectories fail and enables distinguishing precise queries from vague ones within rollout groups.
Compact "Gene" representation outperforms documentation-oriented "Skill" packages for test-time evolution across 4,590 trials in scientific code tasks. Expanding experience into fuller documentation degrades performance, showing representation format is first-order factor in reusable experience.
Prolepsis phenomenon: transformers commit to decisions early via task-specific attention heads that sustain the commitment without later correction. Replicates planning-site findings in Gemma 2 2B and Llama 3.2 1B, showing residual-stream methods miss this behavior while causal lens tracing captures it. The same motif appears across different tasks (planning, factual recall) at different network depths.
GPT-Rosalind is a frontier reasoning model specialized for life sciences research including drug discovery, genomics analysis, protein reasoning, and scientific workflows. Purpose-built for domain-specific scientific acceleration.
LongAct identifies high-magnitude activations in query/key vectors during long-context processing as critical for optimization. Leverages insights from quantization and sparse reasoning structure to guide RL training for improved long-context reasoning.
AIMO 3 competition analysis across 50 IMO problems shows model capability dominates inference-time optimization; diverse prompting strategies fail to beat high-temperature sampling on strong models. The 8-point capability gap persists across all prompt interventions; only verifier-based selection could close remaining selection loss.
RAD-2 combines diffusion-based trajectory generation with RL-optimized discriminator for autonomous driving motion planning. Generator produces diverse multimodal candidates while discriminator reranks by long-term driving quality, addressing stochastic instabilities and lack of corrective feedback in pure imitation learning. Decoupled design avoids applying sparse rewards directly to high-dimensional diffusion process.
Gemma 4 26B and E4B models outperform Qwen 3.5 series in local deployment scenarios, replacing a multi-model routing setup that previously used Qwen variants for chat, reasoning, and code generation. Users report better performance despite similar quantization levels, suggesting improved base model capabilities at comparable parameter counts.
Hugging Face analysis of VAKRA agent system covering reasoning patterns, tool use mechanisms, and common failure modes in agent architectures.
Gemini Robotics-ER 1.6 specialized reasoning model for physical AI achieves 93% success on instrument reading tasks (up from 23% baseline) through agentic vision combining visual reasoning with code execution. It adds spatial reasoning, multi-view perception, and industrial gauge interpretation as a high-level planning layer for vision-language-action robotics models.
Google DeepMind released Gemini Robotics-ER 1.6, a robotics reasoning model with improved spatial reasoning, multi-view perception, instrument reading, and hazard detection (+6% text, +10% video safety). Available via Gemini API with Boston Dynamics deploying it for autonomous Spot robot operations.
Boston Dynamics integrated Gemini and Gemini Robotics-ER 1.6 into Spot's Orbit AIVI systems, enabling robots to perform complex reasoning about industrial environments, identify hazards, and read instruments. The Gemini-powered AIVI-Learning system is now live for existing customers as of April 15, 2026.
C2 trains reward models to critically collaborate with rubric generators using only binary preference data, avoiding costly rubric annotations. The framework generates helpful and misleading rubric pairs to teach the reward model when to rely on or override rubric guidance, addressing the cooperative communication failure where low-quality rubrics mislead verification.
🧠 DeepMind 6d ago
★ High Signal
Google's Gemini 3 Deep Think achieves 48.4% on Humanity's Last Exam and 84.6% on ARC-AGI-2, now available to Ultra subscribers and select enterprise users. Early adopters use it to identify mathematical paper errors missed by peer review and optimize semiconductor crystal growth. Novel application of specialized reasoning mode to scientific and engineering problems beyond standard benchmarks.
VCR-Agent is a multi-agent framework that generates mechanistic action graphs to represent biological reasoning in virtual cells, enabling verification and falsification of LLM-generated explanations. The approach releases VC-TRACES, a dataset of verified biological mechanisms, addressing the challenge of factually grounded scientific explanations from LLMs in open-ended domains like biology.
Gemini Robotics-ER 1.6 enhances spatial reasoning and multi-view understanding for autonomous robotics tasks. Focuses on embodied reasoning capabilities for real-world robot control.
PARROT framework uses reward models that generate explicit multi-dimensional critiques before scoring, enabling test-time critique-and-refine loops that match RL fine-tuning performance without parameter updates. Transforms reward models from passive evaluators to active optimization tools. First demonstration that structured reasoning at inference time can unlock capabilities equivalent to gradient-based training.
Interview with Sebastian Raschka covering 2026 AI architecture evolution, post-training to hybrid models, and Process Reward Models as the next frontier. Discusses his minimal AI stack (Mac mini, Codex, Ollama), fine-tuning as economic decision, and layer-by-layer verification philosophy for his upcoming book 'Build a Reasoning Model from Scratch.'
Meta Muse Spark marks Meta's pivot from open-source to proprietary models, featuring multimodal perception, parallel subagent execution, and a contemplating mode. Built by Meta Superintelligence Labs, it offers competitive vision and language performance but lags in coding, representing Meta's first paid API model after Llama 4's poor reception.
Meta launched Muse Spark, its first proprietary-only model since forming Meta Superintelligence Labs, featuring native multimodal reasoning and "thought compression" achieving results with over 10x less compute than Llama 4 by penalizing excessive thinking time during RL training. The pivot away from open source is confined to Meta AI app/website with private API preview only, sparking backlash from the open source community. Meta refused to clarify whether Llama development has ended.
∇-Reasoner applies first-order gradient descent over token logits during inference, achieving 20%+ accuracy gains on math reasoning while reducing model calls by 10-40%. Theoretically proves inference-time gradient descent in sample space is dual to KL-regularized RL alignment. First work bridging test-time optimization with training-time alignment theory through differentiable decoding.
4.5-hour comprehensive state-of-AI discussion covering LLMs, geopolitics, training approaches, open vs. closed models, AGI timelines, and industry implications in 2026. Technical depth on inference-time scaling and reasoning models. Major synthesis from Raschka and Lambert on field evolution.
4.5-hour discussion with Sebastian Raschka, Nathan Lambert, and Lex Fridman covering 2026 AI landscape including inference-time scaling, RLVR, architecture evolution, open vs closed models, AGI timelines, and economic forces shaping development. Comprehensive synthesis of current industry perspectives and technical directions.
Comprehensive taxonomy of inference-time scaling approaches including recursive language models and test-time compute research. Inference scaling has become most effective method for improving deployed LLM answer quality. Technical explainer for understanding modern reasoning model architectures.
Comprehensive survey organizing agentic reasoning along three dimensions: foundational (planning, tool use, search), self-evolving (feedback, memory, adaptation), and collective multi-agent reasoning. Distinguishes in-context reasoning from post-training reasoning and provides unified taxonomy bridging thought and action across science, robotics, healthcare, and mathematics.
Simon Willison predicts 2026 as inflection point where LLM code quality becomes undeniable, driven by reasoning models trained with RL specifically for code. Also forecasts 2026 as year of solving code sandboxing via containers and WebAssembly, addressing security risks and prompt injection vulnerabilities from executing untrusted LLM-generated code. Critical for safe agentic workflows.
Curated reading list featuring 1 paper/blog/model family per week for all of 2025, covering LLMs, reasoning models, inference-time scaling, and AI engineering. Represents canonical synthesis of 2025's key technical developments from Latent Space podcast.