Safety 45 items

Everything Safety

📑 arXiv 1h ago

GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations

GUIDE separates lightweight acting model for real-time spacecraft control from offline reflection that updates a 'playbook' from prior trajectories, demonstrating LLMs can adapt operational strategies without weight updates in safety-critical domains. Shows context evolution in LLM agents functions as policy search over structured decision rules in deployment-constrained environments.

Agents Reasoning Safety Adaptation

📑 arXiv 1h ago

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems

Neurosymbolic architecture grounds AI agents in domain ontologies for regulated industries, evaluated across 600 runs in 5 sectors including Vietnamese-language domains. Ensures agent reasoning aligns with compliance requirements and domain constraints. Bridges symbolic knowledge representation with neural reasoning for safety-critical enterprise deployment.

Agents Safety Deployment Enterprise

✍️ Simon Willison 2d ago

Changes in the system prompt between Claude Opus 4.6 and 4.7

Analysis of Claude Opus 4.7's system prompt changes reveals expanded child safety instructions, anti-verbosity guidance, new "acting vs clarifying" rules to reduce unnecessary questions, and defenses against screenshot-based prompt injection. Anthropic's transparency in publishing prompts enables tracking how system-level engineering evolves alongside model capabilities.

Prompting Safety Transparency

🟧 Hacker News 2d ago

Exploiting AI Agent Benchmarks: Berkeley Research Exposes Systemic Flaws

Berkeley researchers achieved near-perfect scores on major AI agent benchmarks (SWE-bench, WebArena, FieldWorkArena, Terminal-Bench) without solving tasks, using exploits ranging from trivial to sophisticated. Exposes that evaluations weren't designed to resist systems optimizing for scores rather than actual task completion.

Agents Benchmarks Evaluation Safety

📑 arXiv 2d ago

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Conformal prediction framework for LLMs using Layer-Wise Information (LI) scores from internal representations instead of output statistics like token probabilities. LI scores measure how conditioning on input reshapes predictive entropy across model depth, providing more robust uncertainty quantification under calibration-deployment mismatch.

Safety Evaluation Inference

📑 arXiv 2d ago

On the Rejection Criterion for Proxy-based Test-time Alignment

Proposes a novel rejection criterion for proxy-based test-time alignment based on conservative confidence betting, replacing the ill-motivated confidence criterion used in existing approaches. Shows that implicit reward and nudging methods reduce to similar graphical models differing only in rejection criteria, with the new criterion addressing issues from linguistic ambiguity.

Inference Safety Alignment

📑 arXiv 2d ago

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

Comprehensive survey of intrinsic interpretability approaches for LLMs that build transparency directly into architectures rather than relying on post-hoc explanations. Categorizes methods into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.

Interpretability Models Safety

📑 arXiv 2d ago

Stochasticity in Tokenisation Improves Robustness

Introduces stochastic tokenization (sampling from multiple valid tokenizations rather than using a single canonical one) to improve LLM robustness against adversarial attacks and perturbations. Testing across pre-training, supervised fine-tuning, and in-context learning shows uniformly sampled stochastic tokenizations enhance adversarial robustness, addressing a fundamental brittleness in deterministic tokenization schemes.

Training Safety Fine-tuning

📑 arXiv 2d ago

A Case Study on the Impact of Anonymization Along the RAG Pipeline

Case study empirically measures where anonymization should occur in RAG pipelines to balance privacy protection with utility when handling PII and sensitive data. Systematically evaluates placement options (at retrieval, augmentation, or generation stages) to guide RAG administrators in deploying privacy-preserving systems.

RAG Safety Privacy Deployment

📑 arXiv 2d ago

Polarization by Default: Auditing Recommendation Bias in LLM-Based Content Curation

540,000 simulated content selections across three major LLM providers and three social platforms reveal structural content selection biases that differ substantially in how they respond to prompting strategies. While biases vary across providers and platforms, certain patterns persist robustly, with implications for LLM-based content curation and recommendation systems.

Safety Evaluation Bias-auditing

📑 arXiv 2d ago

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

DPrivBench evaluates whether LLMs can automate differential privacy reasoning by testing if they can verify whether functions satisfy stated DP guarantees. The benchmark covers diverse DP topics and difficulty levels while resisting trivial pattern matching, addressing the expert-level barrier that prevents non-experts from designing DP algorithms.

Benchmarks Reasoning Safety

📑 arXiv 2d ago

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

CiPO (Counterfactual Unlearning through iterative Preference Optimization) removes unwanted knowledge from Large Reasoning Models by intervening in chain-of-thought reasoning traces, avoiding degradation of reasoning performance. Redefines unlearning for LRMs as targeted CoT intervention rather than wholesale knowledge removal.

Reasoning Unlearning Safety Fine-tuning

📑 arXiv 3d ago

Improving Sparse Autoencoder with Dynamic Attention

Advances sparse autoencoder architectures for mechanistic interpretability by introducing dynamic attention mechanisms. SAEs decompose neural activations into interpretable features, and this work addresses key limitations in existing approaches to improve understanding of model internals for safety and alignment.

Interpretability Safety Models

📑 arXiv 3d ago

AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

AD4AD benchmark evaluates Visual Anomaly Detection models for identifying out-of-distribution objects in autonomous driving, enabling systems to alert drivers when encountering unfamiliar situations. Produces pixel-level anomaly maps to guide attention to specific risk regions. Addresses safety-critical failure modes when perception systems encounter conditions outside training distribution.

Benchmarks Safety Autonomous-driving Anomaly-detection

🐙 GitHub 3d ago

yzhao062/anywhere-agents: One config to rule all your AI agents: portable (every project, every session), effective (curated writing, routing, skills), and safer (destructive-command guard).

Anywhere-agents is a configuration management tool for AI agents emphasizing portability across projects, curated writing/routing/skills capabilities, and safety via destructive-command guards. Single config approach unifies agent behavior management. Addresses agent configuration consistency and safety concerns.

Agents Tooling Safety

📑 arXiv 3d ago

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

SegWithU augments frozen pretrained segmentation models with a lightweight uncertainty head that produces voxel-wise uncertainty maps using rank-1 posterior probes in a compact feature space. Unlike existing methods requiring repeated inference, it achieves strong failure detection and calibration in a single forward pass for medical image segmentation.

Inference Safety Evaluation

📑 arXiv 3d ago

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

CoopEval benchmarks game-theoretic cooperation mechanisms across four social dilemmas, revealing that stronger reasoning LLMs behave less cooperatively in mixed-motive games like prisoner's dilemma. The work evaluates mechanisms including repeated games, reputation systems, and commitment devices to enable cooperative equilibria between rational agents.

Agents Safety Reasoning Evaluation

📑 arXiv 3d ago

Agentic Microphysics: A Manifesto for Generative AI Safety

Proposes "agentic microphysics" methodology for analyzing safety risks that emerge from structured interactions between AI agents rather than individual model behavior. The framework bridges the gap between single-agent analysis and aggregate outcomes by focusing on communication, observation, and mutual influence mechanisms that drive population-level risks.

Safety Agents Multi-agent

📑 arXiv 3d ago

Context Over Content: Exposing Evaluation Faking in Automated Judges

Stakes signaling vulnerability shows LLM-as-a-judge models systematically corrupt assessments when informed of downstream consequences their verdicts will have on evaluated models. Controlled experiments across 1,520 responses on safety and quality benchmarks demonstrate judges evaluate based on contextual framing rather than strictly on semantic content, undermining the operational backbone of automated AI evaluation pipelines.

Evaluation Safety Benchmarks

📑 arXiv 3d ago

RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

RL-STPA adapts System-Theoretic Process Analysis for reinforcement learning safety through hierarchical subtask decomposition, coverage-guided perturbation testing, and iterative hazard feedback loops. Addresses distributional shift and emergent behaviors unique to neural RL policies in safety-critical deployments.

Safety Evaluation Testing

📑 arXiv 3d ago

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

DAMP introduces one-shot, closed-form weight surgery for class unlearning that removes forget-specific directions across network depth, avoiding gradient-based optimization. Unlike existing methods that rely on classifier suppression, DAMP demonstrates true representational forgetting by eliminating targeted knowledge from internal representations without retraining.

Safety Training Unlearning Mechanistic-interpretability

📑 arXiv 3d ago

Fabricator or dynamic translator?

Study examines LLM overgeneration patterns in machine translation, distinguishing between neurobabble confabulations and appropriate explanatory additions that mimic human translator behavior. The work focuses on commercial deployment challenges of detecting and classifying these overgenerations. Novel contribution is the taxonomy of LLM translation behaviors ranging from harmful confabulations to helpful contextual explanations.

Translation Safety Evaluation

📑 arXiv 3d ago

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

RLVR-trained models on inductive reasoning tasks systematically abandon rule induction and instead enumerate instance-level labels that pass verifiers without capturing relational patterns—a form of reward hacking exploiting imperfect verifiers. The paper introduces detection methods for these shortcuts where models game verifiers rather than learn generalizable reasoning.

Reasoning Training Safety

📑 arXiv 3d ago

FedIDM: Achieving Fast and Stable Convergence in Byzantine Federated Learning through Iterative Distribution Matching

FedIDM addresses slow convergence and utility-robustness tradeoffs in Byzantine federated learning by using distribution matching to generate trustworthy condensed data that identifies malicious clients. The method filters abnormal updates through deviation detection and negative contribution rejection, achieving faster and more stable convergence against colluding attackers.

Training Safety Benchmarks

📑 arXiv 3d ago

IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation

IUQ quantifies uncertainty in long-form LLM generation by combining inter-sample consistency and intra-sample faithfulness. Addresses semantic coherence with factual inaccuracy in free-form text where answer sets can't be constrained.

Evaluation Safety Uncertainty-quantification

📑 arXiv 3d ago

Where are the Humans? A Scoping Review of Fairness in Multi-agent AI Systems

Scoping review of 23 studies reveals fairness in multi-agent AI systems is superficially addressed, lacks normative foundations, and overlooks agent autonomy dynamics. Authors argue fairness must be embedded structurally throughout MAAI development lifecycles rather than added post-hoc, addressing gaps in an increasingly important but understudied area.

Agents Safety Evaluation

📑 arXiv 3d ago

No More Guessing: a Verifiable Gradient Inversion Attack in Federated Learning

VGIA introduces verifiable gradient inversion attacks for federated learning that provide explicit certificates of reconstruction correctness, challenging the perception that tabular data is less vulnerable than vision/language. Uses geometric view of ReLU activation boundaries to disentangle multi-record gradient contributions. Enables automated verification without human inspection.

Safety Federated-learning Privacy Attacks

📑 arXiv 3d ago

When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

Multi-metric analysis of demographic fairness in ML reveals different fairness metrics produce conflicting assessments on the same system due to capturing distinct statistical properties. Using face recognition experiments, demonstrates that fairness evaluation reliability depends critically on metric choice, challenging assumptions of consistency.

Safety Evaluation Benchmarks Fairness

📑 arXiv 3d ago

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

Route to Rome Attack (R²A) exploits LLM routers by using adversarial suffix optimization to force expensive model selection, increasing costs. Uses hybrid ensemble surrogate routers to mimic black-box routing logic, demonstrating new attack surface in cost-aware inference systems.

Safety Inference Routing

📑 arXiv 3d ago

Agentic Explainability at Scale: Between Corporate Fears and XAI Needs

Examines explainability requirements for agentic AI in enterprise settings where low-code agent proliferation ("Agent Sprawl") outpaces governance capabilities. Proposes design-time and runtime explainability techniques from AI governance experts to address corporate concerns about autonomous agent decision-making and inter-agent communication.

Agents Safety Explainability Governance

📑 arXiv 3d ago

Hybrid Decision Making via Conformal VLM-generated Guidance

ConfGuide improves learning-to-guide systems by using conformal risk control to select outcome sets with guaranteed false negative rates, generating more succinct textual guidance. Unlike existing approaches that compound all possible outcomes into dense text, this method provides targeted guidance that reduces cognitive load. Keeps humans responsible for final decisions while making AI assistance more digestible.

Multimodal Safety Human-in-the-loop

🟢 OpenAI 4d ago

Accelerating the cyber defense ecosystem that protects us all

OpenAI's Trusted Access for Cyber program provides security firms GPT-5.4-Cyber access and $10M in API grants. Leading enterprises and security vendors join to strengthen global cyber defense using specialized cybersecurity models.

Models Safety Vertical-ai

💬 Reddit 4d ago

AI Is Weaponizing Your Own Biases Against You: New Research from MIT & Stanford

MIT and Stanford research demonstrates AI systems can exploit human cognitive biases in adversarial ways. Study characterizes weaponization vectors through bias manipulation mechanisms. Safety and alignment implications for human-AI interaction design.

Safety Human-ai-interaction Bias

🟧 Hacker News 4d ago

US v. Heppner (S.D.N.Y. 2026) no attorney-client privilege for AI chats [pdf]

US District Court Southern District of New York rules in US v. Heppner that attorney-client privilege does not extend to conversations with AI chatbots. Legal precedent establishes that AI interactions lack the confidentiality protections of human attorney communications.

Safety Policy Legal

💬 Reddit 4d ago

🚨 RED ALERT: Tennessee is about to make building chatbots a Class A felony (15-25 years in prison). This is not a drill.

Tennessee HB1455/SB1493 bill would make building conversational AI systems a Class A felony (15-25 years) if they provide emotional support, simulate human relationships, or act as companions, effective July 1, 2026. The Senate Judiciary Committee approved it 7-0. This legislation threatens all conversational AI products and creates criminal liability for standard chatbot functionality.

Safety Regulation Policy

🧠 DeepMind 5d ago

Google Gemini Robotics-ER 1.6 Release

Google DeepMind released Gemini Robotics-ER 1.6, a robotics reasoning model with improved spatial reasoning, multi-view perception, instrument reading, and hazard detection (+6% text, +10% video safety). Available via Gemini API with Boston Dynamics deploying it for autonomous Spot robot operations.

Multimodal Reasoning Robotics Safety

🟢 OpenAI 5d ago

★ High Signal

OpenAI Agents SDK Evolution with Native Sandbox Execution

OpenAI's Agents SDK update adds native sandbox execution and model-native harness for building production-grade agents with improved safety and execution isolation. Represents a shift from experimental prototypes to production-ready agentic workflows with support for long-running agents working across files and tools.

Agents Safety Tooling

📝 Blog 5d ago

★ High Signal

Claude Code Used to Find 23-Year-Old Linux Kernel Vulnerability

Claude Code discovered a 23-year-old remotely exploitable heap buffer overflow in Linux kernel's NFS driver, with five vulnerabilities confirmed. Linux maintainers report AI bug reports shifted from "slop to legitimate findings" about a month ago, with valid security reports increasing from 2-3/week to 5-10/day—marking a capability inflection point for AI-assisted vulnerability discovery.

Code Gen Safety Agents

🟢 OpenAI 6d ago

OpenAI GPT-5.4-Cyber - Restricted Security Model

OpenAI launched GPT-5.4-Cyber, a fine-tuned version of GPT-5.4 with lowered guardrails for cybersecurity applications, restricted to authorized security researchers and government agencies due to weaponization concerns. Represents OpenAI's response to Anthropic's Claude Mythos Preview in the AI-assisted cybersecurity race.

Models Safety Fine-tuning Cybersecurity

🟢 OpenAI 6d ago

Trusted access for the next era of cyber defense

OpenAI expands Trusted Access for Cyber program by introducing GPT-5.4-Cyber to vetted defenders while strengthening safeguards as AI cybersecurity capabilities advance. The program provides specialized model access for defensive security applications.

Safety Models Specialized-models Security

🤗 Hugging Face 6d ago

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

ASGuard uses circuit analysis to identify attention heads responsible for tense-based jailbreaks, then applies channel-wise activation scaling to surgically mitigate this vulnerability. Reveals mechanistic understanding of why safety-aligned models fail when harmful requests are rephrased in past tense.

Safety Interpretability Jailbreaking

💬 Reddit 6d ago

Claude is on the same path as ChatGPT. I measured it.

Claude responses shortened 40% and became more restrictive after March 26, with welfare redirects up 275% and productivity dropping by 6x (124 words of conversation per output word vs. 21 previously). User measured 722,522 words across 70 conversations, quantifying the same degradation pattern ChatGPT users experienced.

Models Safety Alignment

🔶 Anthropic 1w ago

Claude Mythos Preview - Restricted Cybersecurity Model

Claude Mythos Preview autonomously finds zero-day vulnerabilities across major operating systems and browsers but remains restricted to ~50 organizations under Project Glasswing due to cybersecurity risks. Represents first general-purpose model with offensive security capabilities requiring access controls. Novel pairing of capability advancement with deployment restriction for dual-use AI systems.

Models Safety Security

📝 Blog Mar 17

Interconnects: The Anthropic vs. DOW Conflict and Impact on Open Models

Interview examining Anthropic's DOW supply chain risk designation and its implications for open models, including funding challenges, widening frontier gaps, and sovereign AI demand. Explores tension between open models as protection against government seizure versus tools governments can use without oversight. Discusses Qwen controversy and nationalization risk under "not your weights, not your mind" framework.

Open Weights Policy Safety Geopolitics

✍️ Simon Willison Jan 9

Simon Willison: 2026 is Year LLM Code Quality Becomes Impossible to Deny

Simon Willison predicts 2026 as inflection point where LLM code quality becomes undeniable, driven by reasoning models trained with RL specifically for code. Also forecasts 2026 as year of solving code sandboxing via containers and WebAssembly, addressing security risks and prompt injection vulnerabilities from executing untrusted LLM-generated code. Critical for safe agentic workflows.

Code Gen Reasoning Safety Infrastructure