🍑 feedmeAI

The Feed

Everything interesting, as it happens. Curated by Claude, organized chronologically.

Thursday, April 23

πŸ™ GitHub Apr 23

future-agi/future-agi: Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing Β· Evals Β· Simulations Β· Datasets Β· Gateway Β· Guardrails. Self-hostable. Apache 2.0.

Self-hostable, Apache 2.0-licensed platform covering the full LLM application observability and improvement loop: tracing, evals, simulations, datasets, gateway, and guardrails in one stack. Targets teams who want an integrated alternative to stitching together Langfuse, LangSmith, and separate guardrail layers. Open-source with enterprise-grade feature breadth.

Wednesday, April 22

🟒 OpenAI Apr 22

Introducing OpenAI Privacy Filter

OpenAI releases an open-weight PII detection and redaction model called Privacy Filter, claiming state-of-the-art accuracy on identifying personally identifiable information in text. Open weights make it deployable on-prem or in air-gapped environments where sending data to an API is not viable. Directly relevant for enterprise pipelines that need PII scrubbing before feeding data to LLMs.

πŸ“‘ arXiv Apr 22

Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

Textual Parameter Graph Optimization (TPGO) models a multi-agent system as a graph of optimizable nodes (agents, tools, workflows) and derives structured natural-language "textual gradients" from execution traces to guide iterative optimization. Critically, the optimizer itself learns from accumulated optimization history, making the framework self-improving rather than static. This addresses the lack of structural awareness and adaptability in flat prompt-tuning approaches to MAS optimization.

πŸ“‘ arXiv Apr 22

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

Investigates how prompt optimization and judge choice interact in LLM-as-a-Judge evaluations for legal QA on the LEXam benchmark, using ProTeGi optimization with Qwen3-32B and DeepSeek-V3 as judges. Lenient judge feedback yields larger and more consistent gains than strict feedback, and prompts optimized with lenient judges transfer better across judge models. Results highlight that judge disposition is a significant, underappreciated variable in automated evaluation pipelines.

πŸ“‘ arXiv Apr 22

Supplement Generation Training for Enhancing Agentic Task Performance

Supplement Generation Training (SGT) trains a small LLM to produce task-specific supplemental text prepended to the input of a larger frozen LLM, improving downstream task performance without modifying the large model. This decouples task-specific adaptation from expensive full model retraining, making it practical to update only the lightweight supplement generator as base models evolve. The approach is framed as an alternative to repeated post-training of frontier models for agentic tasks.

πŸ’¬ Reddit Apr 22

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

Pairing Qwen3.6-35B with the 'little-coder' agent scaffold achieves 78.7% on the Polyglot coding benchmark, landing in the public top 10 and competitive with leading cloud models. The same scaffold previously lifted a 9B Qwen model from 19.11% to 45.56%, suggesting a significant portion of the local-vs-cloud performance gap is attributable to scaffold/harness mismatch rather than model capability alone.

πŸ’¬ Reddit Apr 22

Why I Stopped Building Autonomous Agents for Clients

A practitioner's post-mortem on building fully autonomous multi-agent systems for clients: unpredictable recursive loops, runaway API costs ($200 in 2 hours), and zero client tolerance for black-box failures pushed the author toward human-in-the-loop, deterministic workflows instead. The core argument β€” autonomy is a liability for most business use cases β€” is grounded in specific failure modes rather than theory.

πŸ™ GitHub Apr 22

Learning to Reason Without External Rewards via Reinforcement Learning from Internal Feedback (RLIF)

Intuitor (ICLR 2026) trains LLMs to improve reasoning using only self-certainty as a reward signalβ€”no labeled data, no external verifier, no human-crafted reward. The companion code release (RLIF framework) enables direct reproduction of the result that models can self-improve on reasoning benchmarks from internal feedback alone. Practically significant because it removes the dependency on curated verifiable datasets.

πŸ’¬ Reddit Apr 22

Claude can end a conversation

Anthropic has implemented an `end_conversation` tool in Claude that allows the model to terminate sessions, reportedly triggered by user insults. The feature appears to be a boundary-enforcement mechanism giving Claude agency to disengage from hostile interactions.

πŸ“‘ arXiv Apr 22

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents (ICLR 2026, MIT)

MEM1 trains agents end-to-end via RL to compress and update an internal memory state at each step, maintaining constant context size across arbitrarily long multi-turn tasks. Unlike RAG or full-context retention, the memory management policy itself is learned. Demonstrated on multi-turn web and tool-use tasks; from MIT, accepted ICLR 2026.

🟒 OpenAI Apr 22

Speeding up agentic workflows with WebSockets in the Responses API

OpenAI engineering post details how the Codex agent loop uses WebSockets in the Responses API to reduce per-request connection overhead and leverages connection-scoped caching to cut model latency in multi-turn agentic workflows. The post quantifies improvements but frames them around the specific Codex loop design. Practical reference for anyone building low-latency agents on top of the Responses API.

πŸ€— Hugging Face Apr 22

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillLearnBench is the first benchmark for continual skill learning in LLM agents, covering 20 verified tasks across 15 sub-domains with evaluation at three levels: skill quality, execution trajectory, and task outcome. Tested methods include one-shot learning, self/teacher feedback, and skill-creator approaches; all improve over the no-skill baseline but none achieves consistent gains across domains. Highlights that automatic skill acquisition for agents remains an unsolved problem despite recent progress.

🟒 OpenAI Apr 22

Introducing workspace agents in ChatGPT

OpenAI introduces workspace agents in ChatGPT: Codex-powered cloud agents that can automate multi-step workflows across tools on behalf of teams. They run asynchronously in the cloud, scoped to a workspace with access controls. This extends Codex beyond single-shot code generation into persistent, team-level agentic task execution.

πŸ“‘ arXiv Apr 22

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Proposes a hybrid architecture where LLMs are augmented with an automatically constructed RDF/OWL ontology as an external memory layer, replacing or supplementing vector-based RAG with a structured knowledge graph. The pipeline performs entity recognition, relation extraction, triple generation, and SHACL/OWL validation from heterogeneous sources, enabling persistent and verifiable reasoning. The key distinction from standard RAG is that retrieved context is semantically structured and constraint-validated rather than embedding-similarity ranked.

πŸ“‘ arXiv Apr 22

Preference Leakage: A Contamination Problem in LLM-as-a-Judge (ICLR 2026)

Identifies 'preference leakage': when the same LLM generates synthetic training data and serves as the judge, it systematically inflates scores for outputs matching its own generation style, biasing leaderboard rankings even when models perform similarly. Demonstrated empirically across several evaluation pipelines. A concrete warning against self-referential LLM-as-a-judge setups.

πŸ’¬ Reddit Apr 22

Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried

Qwen3 TTS achieves real-time local inference with notably expressive output, integrated into the open-source Persona Engine project (ASR→LLM→TTS pipeline with lip-synced avatar). The author positions it as a meaningful step up from prior local TTS options like Sesame for latency-sensitive, fully offline deployments.

πŸ’¬ Reddit Apr 22

An open letter to Anthropic

A Max-tier Claude user shares a personal account of how Claude 4.6 enabled them to organize twenty years of creative work into a shareable system. The post is a user testimonial highlighting Claude's thoughtfulness and pacing as differentiating qualities. No technical content, but signals strong user attachment to a specific model version.

🟧 Hacker News Apr 22

Parallel agents in Zed

Zed editor adds support for running multiple AI agents in parallel within the same workspace, allowing concurrent agentic tasks on different parts of a codebase. No content snippet is available, but the feature extends Zed's existing AI coding capabilities to multi-agent workflows. Relevant for teams evaluating editor-native agent orchestration versus external tooling.

πŸ’¬ Reddit Apr 22

PSA: Anthropic bans organizations without warning

An ~110-user agricultural tech org had all Claude accounts suspended simultaneously without prior warning, with no admin notification and only a Google Form for appeal. The post raises legitimate concerns about Anthropic's enterprise account governance: no escalation path, no advance notice, and no SLA on appeal response. A real operational risk for teams with Claude in production workflows.

Monday, April 20

πŸ™ GitHub Apr 20

cosmicstack-labs/mercury-agent: Soul-driven AI agent with permission-hardened tools, token budgets, and multi-channel access. Runs 24/7 from CLI or Telegram.

CLI/Telegram-accessible AI agent framework with permission-scoped tools, token budget enforcement, and 24/7 uptime. Packages a "soul" config (personality/behavioral constraints) alongside access control primitives. Thin on novel technical depth β€” primarily a structured agent harness.

πŸ’¬ Reddit Apr 20

Spent a weekend actually understanding and building Karpathy's "LLM Wiki" β€” here's what worked, what didn't

A hands-on build report on Karpathy's 'LLM Wiki' concept β€” pre-processing sources into a structured, interlinked markdown wiki rather than retrieving raw chunks at query time. Synthesis and cross-document reasoning questions improve noticeably versus RAG, but the approach struggles with scale, update latency, and source conflicts. Honest tradeoff analysis rather than a benchmark.

Sunday, April 19

πŸ’¬ Reddit Apr 19

The gap between what technical and non-technical people get from AI is huge now

A Reddit thread observes that the practical capability gap between technical and non-technical AI users has widened sharply: non-technical users largely treat LLMs as search, while technical users leverage agents, computer use, Claude Code, and model selection. The post notes that nearly all recent model improvements are coding-focused, leaving general users with little perceived change. Reflects a real bifurcation in who captures value from frontier model advances.

Saturday, April 18

πŸ“ Blog Apr 18
⭐ Editor's Pick

My Workflow for Understanding LLM Architectures

Raschka documents a three-step process for reverse-engineering open-weight model architectures: start with the technical report, cross-reference the HuggingFace config, then validate against the transformers reference implementation. The core argument is that working code is a more reliable source of truth than under-specified papers. Practical guidance for engineers who want to understand architectural nuances firsthand.

Friday, April 17

πŸ“ Blog Apr 17

Practitioner post: Qwen3.6.35B-A3B MoE outperforms Claude Opus 4.7 locally on MacBook Pro at 20.9 GB quantized

Alibaba's Qwen3 6.35B-A3B MoE (35B total, 3B active parameters) reportedly matches or beats Claude Opus 4.7 on local tasks while fitting in 20.9 GB of quantized RAM on a MacBook Pro. If the benchmark methodology holds, this is a notable MoE-for-edge result: frontier-tier quality within consumer-RAM constraints. Practitioner claim; independent verification of benchmark methodology still needed.

Thursday, April 16

πŸ”Ά Anthropic Apr 16
⭐ Editor's Pick

Introducing Claude Opus 4.7

Anthropic's official Claude Opus 4.7 GA post confirms same pricing as 4.6, image resolution raised to 2,576px long edge (~3.75 MP, 3Γ— prior), and a new xhigh effort tier. Coding benchmarks: +13% task resolution on internal 93-task harness, 70% on CursorBench (vs. 58%), 98.5% on XBOW visual-acuity (vs. 54.5%). First model shipped with real-time cyber safeguards derived from the restricted Mythos Preview testbed.

πŸ’¬ Reddit Apr 16
⭐ Editor's Pick

Opus 4.7 is 50% more expensive with context regression?!

User benchmarks show Claude Opus 4.7 scoring 59.2% vs Opus 4.6's 91.9% on the MRCR v2 8-needle 256K context benchmark β€” a sharp context retention regression. Compounding the issue, a tokenizer change reportedly causes Opus 4.7 to consume ~1.35x more tokens than Opus 4.6 and ~2x more than competing proprietary models, effectively raising costs ~50% for equivalent workloads. If the benchmark numbers hold, this is a meaningful quality-cost tradeoff moving in the wrong direction.

🟒 OpenAI Apr 16
⭐ Editor's Pick

OpenAI Agents SDK next evolution: native sandbox execution, model-native harness, configurable memory

OpenAI's Agents SDK gained native sandbox execution and a model-native harness (April 16) for secure, long-running file/tool agents, plus configurable memory and sandbox-aware orchestration. Version 0.4 (April 5) added MCP tool-use and streaming agent handoffs, making SDK-built agents first-class MCP consumers alongside Claude and Cursor. The combined updates meaningfully close the gap between the SDK and production-grade agent frameworks.

πŸ™ GitHub Apr 16

TheArcForge/UniClaude: Claude Code, natively inside Unity Editor. A dockable chat window with full project awareness, 60+ MCP tools, and zero alt-tabbing.

UniClaude embeds Claude Code directly into the Unity Editor as a dockable chat window, giving it full project awareness and access to 60+ MCP tools without leaving the editor. Targets the context-switching friction that plagues game dev AI workflows. Essentially a Unity-native MCP client wired to Claude.

πŸ’¬ Reddit Apr 16

Qwen3.6-35B-A3B released!

Qwen3.6-35B-A3B is a sparse MoE model with 35B total and only 3B active parameters, released under Apache 2.0. Claims agentic coding performance on par with models 10Γ— its active size, with both multimodal thinking and non-thinking modes. Efficient active-parameter footprint makes it practical for inference on constrained hardware.

Tuesday, April 14

πŸ“ Blog Apr 14

r/LocalLLaMA April 2026 community consensus: Qwen 3.5 most recommended family; Qwen3-Coder-Next sweeps local coding

April 2026 r/LocalLLaMA community consensus (143+ posts) names Qwen 3.5 as the most broadly recommended local model family, with Qwen3-Coder-Next as the near-unanimous pick for coding. MiniMax M2.5/M2.7 surface as the go-to for agentic/tool-heavy workloads; Gemma 4 gains traction for general local use; GLM-5/4.7 enters the best-overall conversation.

Monday, April 13

🟧 Hacker News Apr 13
⭐ Editor's Pick

Anthropic Restricts "Mythos Preview" After Autonomous Zero-Day Exploitation Across All Major OSes and Browsers

Anthropic restricted its Mythos Preview model after it autonomously discovered and exploited zero-day vulnerabilities across all major OSes and browsers. Palo Alto Networks assessed similar capabilities as weeks-to-months from broader proliferation; CrowdStrike's 2026 threat report clocked average eCrime breakout at 29 minutes, Mandiant's M-Trends at 22-second adversary hand-off. A sharp illustration of the gap between lab capability and safe deployment for capability-frontier models.

Wednesday, April 8

Ⓜ️ Meta AI Apr 8

Meta Muse Spark: first model from Meta Superintelligence Labs, proprietary pivot from Llama

Meta Superintelligence Labs' first model, Muse Spark, is a small, fast proprietary model with native multimodal perception and multi-agent parallel subagent executionβ€”a sharp departure from Meta's Llama open-source strategy. Led by Alexandr Wang, it powers the revamped Meta AI app with Instant and Thinking modes and is rolling out across WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban glasses. API access is restricted to select partners only.

Saturday, April 4

πŸ“ Blog Apr 4

Components of a Coding Agent

Raschka breaks down the practical anatomy of a coding agent into three components: tool use (file I/O, shell, search), memory (in-context vs. external), and repository-level context management. Written as a grounding companion to his LLM architecture series, it maps abstract agent design concepts onto how systems like Claude Code and Codex actually operate.

Monday, March 16

πŸ“ Blog Mar 16

What Comes Next with Open Models

Lambert argues the open-closed performance gap will widen in 2026 because closed models are accumulating advantages on long-horizon, domain-specific tasks with non-public training data. Proposes a three-class taxonomy: true closed frontier, open frontier, and small specialized open models. Predicts the highest-impact open models will be narrow, fast, cheap sub-agents used as tools inside closed-model pipelines.

Wednesday, February 25

Thursday, February 12

Wednesday, January 21

πŸ“ Blog Jan 21

Get Good at Agents

Lambert documents a real multi-agent coding workflow β€” GPT-5 Pro for planning, Claude Code with Opus 4.5 for implementation, Codex with GPT-5.2 for high-thinking-effort tasks β€” and argues that directing parallel agents on open-ended tasks is replacing individual grind as the primary work mode. The thesis: scoping and directing agents is the durable skill edge, not raw effort.