🍡 feedmeAI
← All topics
Safety 8 items

Everything Safety

🐙 GitHub Apr 23

future-agi/future-agi: Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Guardrails. Self-hostable. Apache 2.0.

Self-hostable, Apache 2.0-licensed platform covering the full LLM application observability and improvement loop: tracing, evals, simulations, datasets, gateway, and guardrails in one stack. Targets teams who want an integrated alternative to stitching together Langfuse, LangSmith, and separate guardrail layers. Open-source with enterprise-grade feature breadth.

💬 Reddit Apr 22

PSA: Anthropic bans organizations without warning

An ~110-user agricultural tech org had all Claude accounts suspended simultaneously without prior warning, with no admin notification and only a Google Form for appeal. The post raises legitimate concerns about Anthropic's enterprise account governance: no escalation path, no advance notice, and no SLA on appeal response. A real operational risk for teams with Claude in production workflows.

💬 Reddit Apr 22

Claude can end a conversation

Anthropic has implemented an `end_conversation` tool in Claude that allows the model to terminate sessions, reportedly triggered by user insults. The feature appears to be a boundary-enforcement mechanism giving Claude agency to disengage from hostile interactions.

📑 arXiv Apr 22

Preference Leakage: A Contamination Problem in LLM-as-a-Judge (ICLR 2026)

Identifies 'preference leakage': when the same LLM generates synthetic training data and serves as the judge, it systematically inflates scores for outputs matching its own generation style, biasing leaderboard rankings even when models perform similarly. Demonstrated empirically across several evaluation pipelines. A concrete warning against self-referential LLM-as-a-judge setups.

🟢 OpenAI Apr 22

Introducing OpenAI Privacy Filter

OpenAI releases an open-weight PII detection and redaction model called Privacy Filter, claiming state-of-the-art accuracy on identifying personally identifiable information in text. Open weights make it deployable on-prem or in air-gapped environments where sending data to an API is not viable. Directly relevant for enterprise pipelines that need PII scrubbing before feeding data to LLMs.

🔶 Anthropic Apr 16
⭐ Editor's Pick

Introducing Claude Opus 4.7

Anthropic's official Claude Opus 4.7 GA post confirms same pricing as 4.6, image resolution raised to 2,576px long edge (~3.75 MP, 3× prior), and a new xhigh effort tier. Coding benchmarks: +13% task resolution on internal 93-task harness, 70% on CursorBench (vs. 58%), 98.5% on XBOW visual-acuity (vs. 54.5%). First model shipped with real-time cyber safeguards derived from the restricted Mythos Preview testbed.

🟧 Hacker News Apr 13
⭐ Editor's Pick

Anthropic Restricts "Mythos Preview" After Autonomous Zero-Day Exploitation Across All Major OSes and Browsers

Anthropic restricted its Mythos Preview model after it autonomously discovered and exploited zero-day vulnerabilities across all major OSes and browsers. Palo Alto Networks assessed similar capabilities as weeks-to-months from broader proliferation; CrowdStrike's 2026 threat report clocked average eCrime breakout at 29 minutes, Mandiant's M-Trends at 22-second adversary hand-off. A sharp illustration of the gap between lab capability and safe deployment for capability-frontier models.