Claude system prompts as a git timeline
Git repository tracking evolution of Claude system prompts over time. Enables analysis of how Anthropic adjusts model behavior and guardrails through prompt engineering.
Git repository tracking evolution of Claude system prompts over time. Enables analysis of how Anthropic adjusts model behavior and guardrails through prompt engineering.
Analysis of Claude Opus 4.7's system prompt changes reveals expanded child safety instructions, anti-verbosity guidance, new "acting vs clarifying" rules to reduce unnecessary questions, and defenses against screenshot-based prompt injection. Anthropic's transparency in publishing prompts enables tracking how system-level engineering evolves alongside model capabilities.
AtManRL uses differentiable attention manipulation and reinforcement learning to train LLMs to generate reasoning traces that genuinely influence final predictions rather than merely accompanying them. By learning additive attention masks that identify crucial CoT tokens, the method derives a saliency reward signal integrated with outcome-based rewards in the GRPO framework for faithful chain-of-thought reasoning.
Extracted the scholarly reasoning systems of two prominent humanities scholars from published corpora, converted them into structured inference-time constraints for LLMs, and tested whether resulting scholar-bots could perform doctoral supervision, peer review, and lecturing at expert quality. Expert assessment found outputs met appointment-level quality standards, raising questions about knowledge work automation from public scholarship alone.
Chain-of-Thought prompting consistently degrades performance in visual spatial reasoning tasks across seventeen multimodal models and thirteen benchmarks. Novel No-Image++ ablation reveals MRMs hallucinate visual details from textual priors even when images are absent, indicating severe shortcut learning in CoT-prompted vision-language models.
SCHK-HTC improves few-shot hierarchical text classification by using sibling contrastive learning to distinguish semantically similar classes at deep hierarchy levels, rather than only enforcing parent-child consistency. The method addresses the bottleneck of insufficient domain knowledge for differentiating sibling classes under data-scarce conditions.
Survey categorizing graph-LLM integration methods by purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, causal graphs), and integration strategy (prompting, augmentation, training, agent-based). Provides clarity on when and what types of graph representations enhance LLM capabilities.
DiZiNER simulates pilot annotation processes where multiple heterogeneous LLMs act as annotators and supervisors to refine instructions for zero-shot NER. The framework identifies systematic errors by generating disagreements between models, mirroring how human annotation resolves inconsistencies to improve zero-shot performance toward supervised baselines.
DiscoTrace analyzes rhetorical strategies in information-seeking answers by representing them as sequences of discourse acts paired with question interpretations. Human communities show diverse answering preferences, while LLMs lack rhetorical diversity and systematically favor breadth over depth regardless of prompting. Reveals fundamental differences in how humans and models construct answers beyond surface-level content.
Compact "Gene" representation outperforms documentation-oriented "Skill" packages for test-time evolution across 4,590 trials in scientific code tasks. Expanding experience into fuller documentation degrades performance, showing representation format is first-order factor in reusable experience.
InhouseSEO releases 11 production-tested Claude skills for SEO workflows including page audits, E-E-A-T analysis, semantic gap detection, and article writing with anti-AI-slop generation rules. Built on methodology from industry practitioners Koray Tuğberk, Kyle Roof, and Lily Ray.
Chip Huyen's 'AI Engineering' book became O'Reilly's most-read since launch, covering evaluation, prompt engineering, RAG, fine-tuning, dataset engineering, and production architecture. Emphasizes evaluation as the most critical part of AI engineering and data as the most valuable asset in an era of commoditized models.