🍡 feedmeAI
← All topics
Interpretability 12 items

Everything Interpretability

📑 arXiv 1h ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Transformers make irrevocable decisions before seeing full context, replicating rhyme-planning findings on open-weights models and extending to factual recall. Reveals premature binding mechanisms that limit reasoning—models commit to answers too early. First mechanistic evidence of early commitment across multiple task types.

📑 arXiv 2d ago

Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

RISE (Readout Influence Sketching Estimator) achieves scalable data attribution for LLMs by focusing on influence hotspots at the output layer rather than computing gradients across the entire model. Uses CountSketch projections on dual-channel representation (lexical residual + semantic projected-error) to make gradient-based attribution tractable for large models.

📑 arXiv 2d ago

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

Comprehensive survey of intrinsic interpretability approaches for LLMs that build transparency directly into architectures rather than relying on post-hoc explanations. Categorizes methods into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.

📑 arXiv 2d ago

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Investigation of LLM arithmetic reveals models recognize tasks early but generate correct results only in final layers, with proficient models exhibiting clear division of labor: attention modules propagate input information while MLP modules aggregate it. This attention-MLP specialization is absent in less capable models, traced via early decoding across layers.

📑 arXiv 3d ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Investigates when small transformers make early, irreversible commitments to outputs during forward passes, replicating findings on open-weights models and extending to factual recall tasks. Understanding minimal architectures for planning-like behavior reveals how models perform multi-step reasoning with limited computational resources, advancing mechanistic interpretability.

📑 arXiv 3d ago

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

LLMs and VLMs can perform viewpoint rotation understanding tasks using only text descriptions, without visual input. The study investigates how models infer final viewpoints and predict observations after textual descriptions of rotations, examining whether linguistic intelligence alone enables spatial reasoning. Uses interpretability methods to understand the internal mechanisms enabling this capability.

📑 arXiv 3d ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Prolepsis phenomenon: transformers commit to decisions early via task-specific attention heads that sustain the commitment without later correction. Replicates planning-site findings in Gemma 2 2B and Llama 3.2 1B, showing residual-stream methods miss this behavior while causal lens tracing captures it. The same motif appears across different tasks (planning, factual recall) at different network depths.