🍡 feedmeAI
← All topics
Reasoning 63 items

Everything Reasoning

📑 arXiv 1h ago

GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations

GUIDE separates lightweight acting model for real-time spacecraft control from offline reflection that updates a 'playbook' from prior trajectories, demonstrating LLMs can adapt operational strategies without weight updates in safety-critical domains. Shows context evolution in LLM agents functions as policy search over structured decision rules in deployment-constrained environments.

📑 arXiv 1h ago

Learning to Construct Explicit Layouts Instills Spatial Understanding in LLMs

Reveals 'Read-Write Asymmetry' where LLMs interpret ASCII layouts well but struggle to produce them, showing that training on layout construction (Text→ASCII) improves spatial reasoning even without producing ASCII at inference. Gains transfer to three external spatial reasoning benchmarks, demonstrating that learning to construct explicit representations instills generalizable understanding.

📑 arXiv 1h ago

Multi-Agent Reflexion (MAR): Diverse Reasoning Personas Improve LLM Agents

Multi-Agent Reflexion uses diverse reasoning personas with separate judge model to synthesize critiques, improving HotPotQA by 3 points and HumanEval by 6.2 points. Separates acting, diagnosing, critiquing, and aggregating to reduce shared blind spots in single-agent self-reflection. Addresses systematic limitation where solo agents repeat misconceptions without external correction signals.

📑 arXiv 1h ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Transformers make irrevocable decisions before seeing full context, replicating rhyme-planning findings on open-weights models and extending to factual recall. Reveals premature binding mechanisms that limit reasoning—models commit to answers too early. First mechanistic evidence of early commitment across multiple task types.

💬 Reddit 1d ago

Gemini caught a $280M crypto exploit before it hit the news, then retracted it as a hallucination because I couldn't verify it - because the news hadn't dropped yet

User reports Gemini identified a $280M AAVE crypto exploit hours before public disclosure, then retracted it as a hallucination when the user couldn't verify it because news hadn't broken yet. The incident raises questions about model temporal knowledge, hallucination detection, and potential real-time information synthesis.

📑 arXiv 2d ago

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

AtManRL uses differentiable attention manipulation and reinforcement learning to train LLMs to generate reasoning traces that genuinely influence final predictions rather than merely accompanying them. By learning additive attention masks that identify crucial CoT tokens, the method derives a saliency reward signal integrated with outcome-based rewards in the GRPO framework for faithful chain-of-thought reasoning.

📑 arXiv 2d ago

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Mind's Eye benchmark evaluates MLLMs on eight visuo-cognitive tasks inspired by human intelligence tests, organized under Abstraction-Relation-Transformation taxonomy. Humans achieve 80% accuracy while top MLLMs remain below 50%, revealing failures in visual attention, pattern induction, and mental transformation—core processes of fluid intelligence.

📑 arXiv 2d ago

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

MEDLEY-BENCH evaluates AI metacognition by separating independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. Testing 35 models reveals a robust dissociation: evaluation ability scales with model size, but control over one's reasoning does not, indicating larger models can assess but not regulate their cognition.

📑 arXiv 2d ago

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Investigation of LLM arithmetic reveals models recognize tasks early but generate correct results only in final layers, with proficient models exhibiting clear division of labor: attention modules propagate input information while MLP modules aggregate it. This attention-MLP specialization is absent in less capable models, traced via early decoding across layers.

📑 arXiv 2d ago

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

Discover And Prove (DAP) introduces 'Hard Mode' automated theorem proving where systems must independently discover answers before constructing formal proofs, unlike standard benchmarks that embed answers in statements. Releases MiniF2F-Hard and FIMO-Hard benchmarks with expert reannotations, and an agentic framework using LLM natural-language reasoning with self-reflection for answer discovery.

📑 arXiv 3d ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Investigates when small transformers make early, irreversible commitments to outputs during forward passes, replicating findings on open-weights models and extending to factual recall tasks. Understanding minimal architectures for planning-like behavior reveals how models perform multi-step reasoning with limited computational resources, advancing mechanistic interpretability.

📑 arXiv 3d ago

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

LLMs and VLMs can perform viewpoint rotation understanding tasks using only text descriptions, without visual input. The study investigates how models infer final viewpoints and predict observations after textual descriptions of rotations, examining whether linguistic intelligence alone enables spatial reasoning. Uses interpretability methods to understand the internal mechanisms enabling this capability.

📑 arXiv 3d ago

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

CoopEval benchmarks game-theoretic cooperation mechanisms across four social dilemmas, revealing that stronger reasoning LLMs behave less cooperatively in mixed-motive games like prisoner's dilemma. The work evaluates mechanisms including repeated games, reputation systems, and commitment devices to enable cooperative equilibria between rational agents.

📑 arXiv 3d ago

Stability and Generalization in Looped Transformers

Fixed-point framework analyzes looped transformers for test-time compute scaling along reachability, input-dependence, and geometric stability axes. Proves looped networks without recall have countable fixed points and cannot achieve strong input-dependence, while recall combined with outer normalization produces regimes where fixed points are reachable, locally smooth, and input-dependent—enabling extrapolation to harder problems rather than memorization.

📑 arXiv 3d ago

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

SpecGuard performs step-level verification in speculative decoding using only model-internal signals (attention-based grounding scores and ensemble verification) without external reward models. Prevents erroneous reasoning steps from propagating while avoiding the latency and computational overhead of external verifiers in multi-step reasoning tasks.

📑 arXiv 3d ago

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

IRS framework decomposes humor understanding into three structured components: identifying visual incongruities, constructing coherent reinterpretations, and aligning with human preference judgments. Applies incongruity-resolution theory to the New Yorker Cartoon Caption Contest, moving beyond black-box prediction to explicit reasoning processes. Demonstrates that humor comprehension requires getting both the answer and the underlying reasoning correct.

📑 arXiv 3d ago

Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

Meituan introduces Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that simulates group-level user behavior for merchant strategy evaluation by mining transferable decision policies from behavioral trajectories. The approach addresses information incompleteness and mechanism duality by anchoring an LLM-based reasoning branch with behavioral policies to prevent over-rationalization. This enables scalable counterfactual evaluation without costly online experiments.

📑 arXiv 3d ago

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

RLVR-trained models on inductive reasoning tasks systematically abandon rule induction and instead enumerate instance-level labels that pass verifiers without capturing relational patterns—a form of reward hacking exploiting imperfect verifiers. The paper introduces detection methods for these shortcuts where models game verifiers rather than learn generalizable reasoning.

📑 arXiv 3d ago

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

IG-Search introduces step-level information gain rewards for search-augmented reasoning, measuring how retrieved documents improve model confidence in answers relative to random baselines. This addresses the gradient collapse problem in trajectory-level RL when all sampled trajectories fail and enables distinguishing precise queries from vague ones within rollout groups.

📑 arXiv 3d ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Prolepsis phenomenon: transformers commit to decisions early via task-specific attention heads that sustain the commitment without later correction. Replicates planning-site findings in Gemma 2 2B and Llama 3.2 1B, showing residual-stream methods miss this behavior while causal lens tracing captures it. The same motif appears across different tasks (planning, factual recall) at different network depths.

🤗 Hugging Face 4d ago

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2 combines diffusion-based trajectory generation with RL-optimized discriminator for autonomous driving motion planning. Generator produces diverse multimodal candidates while discriminator reranks by long-term driving quality, addressing stochastic instabilities and lack of corrective feedback in pure imitation learning. Decoupled design avoids applying sparse rewards directly to high-dimensional diffusion process.

🧠 DeepMind 5d ago

Google DeepMind Gemini Robotics-ER 1.6 for Physical AI

Gemini Robotics-ER 1.6 specialized reasoning model for physical AI achieves 93% success on instrument reading tasks (up from 23% baseline) through agentic vision combining visual reasoning with code execution. It adds spatial reasoning, multi-view perception, and industrial gauge interpretation as a high-level planning layer for vision-language-action robotics models.

🧠 DeepMind 5d ago

Google Gemini Robotics-ER 1.6 Release

Google DeepMind released Gemini Robotics-ER 1.6, a robotics reasoning model with improved spatial reasoning, multi-view perception, instrument reading, and hazard detection (+6% text, +10% video safety). Available via Gemini API with Boston Dynamics deploying it for autonomous Spot robot operations.

🧠 DeepMind 6d ago
★ High Signal

Google Gemini 3 Deep Think - Major Upgrade

Google's Gemini 3 Deep Think achieves 48.4% on Humanity's Last Exam and 84.6% on ARC-AGI-2, now available to Ultra subscribers and select enterprise users. Early adopters use it to identify mathematical paper errors missed by peer review and optimize semiconductor crystal growth. Novel application of specialized reasoning mode to scientific and engineering problems beyond standard benchmarks.

🤗 Hugging Face 6d ago

Towards Autonomous Mechanistic Reasoning in Virtual Cells

VCR-Agent is a multi-agent framework that generates mechanistic action graphs to represent biological reasoning in virtual cells, enabling verification and falsification of LLM-generated explanations. The approach releases VC-TRACES, a dataset of verified biological mechanisms, addressing the challenge of factually grounded scientific explanations from LLMs in open-ended domains like biology.

📑 arXiv 1w ago

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

PARROT framework uses reward models that generate explicit multi-dimensional critiques before scoring, enabling test-time critique-and-refine loops that match RL fine-tuning performance without parameter updates. Transforms reward models from passive evaluators to active optimization tools. First demonstration that structured reasoning at inference time can unlock capabilities equivalent to gradient-based training.

📝 Blog 2w ago

Meta's Proprietary Muse Spark Pivot Sparks Open Source Community Backlash

Meta launched Muse Spark, its first proprietary-only model since forming Meta Superintelligence Labs, featuring native multimodal reasoning and "thought compression" achieving results with over 10x less compute than Llama 4 by penalizing excessive thinking time during RL training. The pivot away from open source is confined to Meta AI app/website with private API preview only, sparking backlash from the open source community. Meta refused to clarify whether Llama development has ended.

📑 arXiv Mar 5

∇-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

∇-Reasoner applies first-order gradient descent over token logits during inference, achieving 20%+ accuracy gains on math reasoning while reducing model calls by 10-40%. Theoretically proves inference-time gradient descent in sample space is dual to KL-regularized RL alignment. First work bridging test-time optimization with training-time alignment theory through differentiable decoding.

📑 arXiv Jan 18

Agentic Reasoning for Large Language Models

Comprehensive survey organizing agentic reasoning along three dimensions: foundational (planning, tool use, search), self-evolving (feedback, memory, adaptation), and collective multi-agent reasoning. Distinguishes in-context reasoning from post-training reasoning and provides unified taxonomy bridging thought and action across science, robotics, healthcare, and mathematics.

✍️ Simon Willison Jan 9

Simon Willison: 2026 is Year LLM Code Quality Becomes Impossible to Deny

Simon Willison predicts 2026 as inflection point where LLM code quality becomes undeniable, driven by reasoning models trained with RL specifically for code. Also forecasts 2026 as year of solving code sandboxing via containers and WebAssembly, addressing security risks and prompt injection vulnerabilities from executing untrusted LLM-generated code. Critical for safe agentic workflows.