Prompting 12 items

Everything Prompting

✍️ Simon Willison 1d ago

Claude system prompts as a git timeline

Git repository tracking evolution of Claude system prompts over time. Enables analysis of how Anthropic adjusts model behavior and guardrails through prompt engineering.

Prompting Models

✍️ Simon Willison 2d ago

Changes in the system prompt between Claude Opus 4.6 and 4.7

Analysis of Claude Opus 4.7's system prompt changes reveals expanded child safety instructions, anti-verbosity guidance, new "acting vs clarifying" rules to reduce unnecessary questions, and defenses against screenshot-based prompt injection. Anthropic's transparency in publishing prompts enables tracking how system-level engineering evolves alongside model capabilities.

Prompting Safety Transparency

📑 arXiv 2d ago

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

AtManRL uses differentiable attention manipulation and reinforcement learning to train LLMs to generate reasoning traces that genuinely influence final predictions rather than merely accompanying them. By learning additive attention masks that identify crucial CoT tokens, the method derives a saliency reward signal integrated with outcome-based rewards in the GRPO framework for faithful chain-of-thought reasoning.

Reasoning Training Prompting

📑 arXiv 2d ago

The Relic Condition: When Published Scholarship Becomes Material for Its Own Replacement

Extracted the scholarly reasoning systems of two prominent humanities scholars from published corpora, converted them into structured inference-time constraints for LLMs, and tested whether resulting scholar-bots could perform doctoral supervision, peer review, and lecturing at expert quality. Expert assessment found outputs met appointment-level quality standards, raising questions about knowledge work automation from public scholarship alone.

Agents Prompting Evaluation

📑 arXiv 2d ago

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Chain-of-Thought prompting consistently degrades performance in visual spatial reasoning tasks across seventeen multimodal models and thirteen benchmarks. Novel No-Image++ ablation reveals MRMs hallucinate visual details from textual priors even when images are absent, indicating severe shortcut learning in CoT-prompted vision-language models.

Multimodal Reasoning Prompting Evaluation

📑 arXiv 2d ago

SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification

SCHK-HTC improves few-shot hierarchical text classification by using sibling contrastive learning to distinguish semantically similar classes at deep hierarchy levels, rather than only enforcing parent-child consistency. The method addresses the bottleneck of insufficient domain knowledge for differentiating sibling classes under data-scarce conditions.

Few-shot Classification Prompting

📑 arXiv 2d ago

Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval

Survey categorizing graph-LLM integration methods by purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, causal graphs), and integration strategy (prompting, augmentation, training, agent-based). Provides clarity on when and what types of graph representations enhance LLM capabilities.

RAG Reasoning Agents Prompting

📑 arXiv 2d ago

DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition

DiZiNER simulates pilot annotation processes where multiple heterogeneous LLMs act as annotators and supervisors to refine instructions for zero-shot NER. The framework identifies systematic errors by generating disagreements between models, mirroring how human annotation resolves inconsistencies to improve zero-shot performance toward supervised baselines.

Agents Information-extraction Prompting

📑 arXiv 3d ago

DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering

DiscoTrace analyzes rhetorical strategies in information-seeking answers by representing them as sequences of discourse acts paired with question interpretations. Human communities show diverse answering preferences, while LLMs lack rhetorical diversity and systematically favor breadth over depth regardless of prompting. Reveals fundamental differences in how humans and models construct answers beyond surface-level content.

Evaluation Prompting Benchmarks

📑 arXiv 3d ago

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Compact "Gene" representation outperforms documentation-oriented "Skill" packages for test-time evolution across 4,590 trials in scientific code tasks. Expanding experience into fuller documentation degrades performance, showing representation format is first-order factor in reusable experience.

Code Gen Prompting Reasoning

🐙 GitHub 6d ago

inhouseseo/superseo-skills: 11 Claude skills for SEO: page audits, linkbuilding, article writing, E-E-A-T audits, semantic gap analysis, link building. Methodology from Koray Tuğberk, Kyle Roof, and Lily Ray, plus a generation-time anti-AI-slop ruleset. Production-tested at InhouseSEO

InhouseSEO releases 11 production-tested Claude skills for SEO workflows including page audits, E-E-A-T analysis, semantic gap detection, and article writing with anti-AI-slop generation rules. Built on methodology from industry practitioners Koray Tuğberk, Kyle Roof, and Lily Ray.

Agents Prompting Content-generation

📝 Blog Jan 1

Chip Huyen: AI Engineering Book - Most Read on O'Reilly Since Launch

Chip Huyen's 'AI Engineering' book became O'Reilly's most-read since launch, covering evaluation, prompt engineering, RAG, fine-tuning, dataset engineering, and production architecture. Emphasizes evaluation as the most critical part of AI engineering and data as the most valuable asset in an era of commoditized models.

Evaluation RAG Fine-tuning Prompting