Interpretability 12 items

Everything Interpretability

📑 arXiv 1h ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Transformers make irrevocable decisions before seeing full context, replicating rhyme-planning findings on open-weights models and extending to factual recall. Reveals premature binding mechanisms that limit reasoning—models commit to answers too early. First mechanistic evidence of early commitment across multiple task types.

Reasoning Open Weights Interpretability

📑 arXiv 2d ago

Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

RISE (Readout Influence Sketching Estimator) achieves scalable data attribution for LLMs by focusing on influence hotspots at the output layer rather than computing gradients across the entire model. Uses CountSketch projections on dual-channel representation (lexical residual + semantic projected-error) to make gradient-based attribution tractable for large models.

Training Data-attribution Interpretability Optimization

📑 arXiv 2d ago

Prototype-Grounded Concept Models for Verifiable Concept Alignment

Prototype-Grounded Concept Models ground learned concepts in visual prototypes (image parts) to verify whether concepts align with human intent, enabling direct inspection and targeted human intervention. Matches CBM predictive performance while substantially improving transparency and intervenability through explicit concept evidence.

Interpretability Multimodal Alignment

📑 arXiv 2d ago

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

Comprehensive survey of intrinsic interpretability approaches for LLMs that build transparency directly into architectures rather than relying on post-hoc explanations. Categorizes methods into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.

Interpretability Models Safety

📑 arXiv 2d ago

Towards Rigorous Explainability by Feature Attribution

Advocates for rigorous symbolic XAI methods over non-symbolic approaches like SHAP and Shapley values, which provably lack rigor and can mislead in high-stakes decisions. The paper overviews symbolic methods for assigning relative feature importance that provide formal guarantees instead of heuristic explanations.

Interpretability Explainability Theory

📑 arXiv 2d ago

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Investigation of LLM arithmetic reveals models recognize tasks early but generate correct results only in final layers, with proficient models exhibiting clear division of labor: attention modules propagate input information while MLP modules aggregate it. This attention-MLP specialization is absent in less capable models, traced via early decoding across layers.

Reasoning Interpretability Mechanistic-analysis

📑 arXiv 3d ago

Improving Sparse Autoencoder with Dynamic Attention

Advances sparse autoencoder architectures for mechanistic interpretability by introducing dynamic attention mechanisms. SAEs decompose neural activations into interpretable features, and this work addresses key limitations in existing approaches to improve understanding of model internals for safety and alignment.

Interpretability Safety Models

📑 arXiv 3d ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Investigates when small transformers make early, irreversible commitments to outputs during forward passes, replicating findings on open-weights models and extending to factual recall tasks. Understanding minimal architectures for planning-like behavior reveals how models perform multi-step reasoning with limited computational resources, advancing mechanistic interpretability.

Interpretability Reasoning Models

📑 arXiv 3d ago

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

LLMs and VLMs can perform viewpoint rotation understanding tasks using only text descriptions, without visual input. The study investigates how models infer final viewpoints and predict observations after textual descriptions of rotations, examining whether linguistic intelligence alone enables spatial reasoning. Uses interpretability methods to understand the internal mechanisms enabling this capability.

Multimodal Reasoning Interpretability

📑 arXiv 3d ago

What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

Prolepsis phenomenon: transformers commit to decisions early via task-specific attention heads that sustain the commitment without later correction. Replicates planning-site findings in Gemma 2 2B and Llama 3.2 1B, showing residual-stream methods miss this behavior while causal lens tracing captures it. The same motif appears across different tasks (planning, factual recall) at different network depths.

Reasoning Interpretability Models

💬 Reddit 4d ago

Video of how my LLM's decoder blocks changed while training

Developer visualized decoder block activation patterns during LLM training as video, showing how internal representations evolve across training steps. Lossless version and projection data released on Hugging Face with video generation source code. Provides interpretability insight into transformer training dynamics.

Training Interpretability Visualization

🤗 Hugging Face 6d ago

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

ASGuard uses circuit analysis to identify attention heads responsible for tense-based jailbreaks, then applies channel-wise activation scaling to surgically mitigate this vulnerability. Reveals mechanistic understanding of why safety-aligned models fail when harmful requests are rephrased in past tense.

Safety Interpretability Jailbreaking