Transformers make irrevocable decisions before seeing full context, replicating rhyme-planning findings on open-weights models and extending to factual recall. Reveals premature binding mechanisms that limit reasoning—models commit to answers too early. First mechanistic evidence of early commitment across multiple task types.
RISE (Readout Influence Sketching Estimator) achieves scalable data attribution for LLMs by focusing on influence hotspots at the output layer rather than computing gradients across the entire model. Uses CountSketch projections on dual-channel representation (lexical residual + semantic projected-error) to make gradient-based attribution tractable for large models.
Prototype-Grounded Concept Models ground learned concepts in visual prototypes (image parts) to verify whether concepts align with human intent, enabling direct inspection and targeted human intervention. Matches CBM predictive performance while substantially improving transparency and intervenability through explicit concept evidence.
Comprehensive survey of intrinsic interpretability approaches for LLMs that build transparency directly into architectures rather than relying on post-hoc explanations. Categorizes methods into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.
Advocates for rigorous symbolic XAI methods over non-symbolic approaches like SHAP and Shapley values, which provably lack rigor and can mislead in high-stakes decisions. The paper overviews symbolic methods for assigning relative feature importance that provide formal guarantees instead of heuristic explanations.
Investigation of LLM arithmetic reveals models recognize tasks early but generate correct results only in final layers, with proficient models exhibiting clear division of labor: attention modules propagate input information while MLP modules aggregate it. This attention-MLP specialization is absent in less capable models, traced via early decoding across layers.
Advances sparse autoencoder architectures for mechanistic interpretability by introducing dynamic attention mechanisms. SAEs decompose neural activations into interpretable features, and this work addresses key limitations in existing approaches to improve understanding of model internals for safety and alignment.
Investigates when small transformers make early, irreversible commitments to outputs during forward passes, replicating findings on open-weights models and extending to factual recall tasks. Understanding minimal architectures for planning-like behavior reveals how models perform multi-step reasoning with limited computational resources, advancing mechanistic interpretability.
LLMs and VLMs can perform viewpoint rotation understanding tasks using only text descriptions, without visual input. The study investigates how models infer final viewpoints and predict observations after textual descriptions of rotations, examining whether linguistic intelligence alone enables spatial reasoning. Uses interpretability methods to understand the internal mechanisms enabling this capability.
Prolepsis phenomenon: transformers commit to decisions early via task-specific attention heads that sustain the commitment without later correction. Replicates planning-site findings in Gemma 2 2B and Llama 3.2 1B, showing residual-stream methods miss this behavior while causal lens tracing captures it. The same motif appears across different tasks (planning, factual recall) at different network depths.
Developer visualized decoder block activation patterns during LLM training as video, showing how internal representations evolve across training steps. Lossless version and projection data released on Hugging Face with video generation source code. Provides interpretability insight into transformer training dynamics.
ASGuard uses circuit analysis to identify attention heads responsible for tense-based jailbreaks, then applies channel-wise activation scaling to surgically mitigate this vulnerability. Reveals mechanistic understanding of why safety-aligned models fail when harmful requests are rephrased in past tense.