Agents 24 items

Everything Agents

🐙 GitHub Apr 23

future-agi/future-agi: Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Guardrails. Self-hostable. Apache 2.0.

Self-hostable, Apache 2.0-licensed platform covering the full LLM application observability and improvement loop: tracing, evals, simulations, datasets, gateway, and guardrails in one stack. Targets teams who want an integrated alternative to stitching together Langfuse, LangSmith, and separate guardrail layers. Open-source with enterprise-grade feature breadth.

Evaluation Agents Infrastructure Safety

🐙 GitHub Apr 23

nandrzej/vlnr: AI security agent for the Python supply chain: scans packages, generates exploits, and validates them in Docker, autonomously.

vlnr is an autonomous security agent for the Python supply chain: it scans packages for vulnerabilities, generates proof-of-concept exploits, and validates them inside isolated Docker containers. Full-loop autonomous exploit generation and validation is the novel aspect.

Agents Safety Code Gen

🟧 Hacker News Apr 22

Parallel agents in Zed

Zed editor adds support for running multiple AI agents in parallel within the same workspace, allowing concurrent agentic tasks on different parts of a codebase. No content snippet is available, but the feature extends Zed's existing AI coding capabilities to multi-agent workflows. Relevant for teams evaluating editor-native agent orchestration versus external tooling.

Agents Code Gen Tooling Infrastructure

📑 arXiv Apr 22

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Proposes a hybrid architecture where LLMs are augmented with an automatically constructed RDF/OWL ontology as an external memory layer, replacing or supplementing vector-based RAG with a structured knowledge graph. The pipeline performs entity recognition, relation extraction, triple generation, and SHACL/OWL validation from heterogeneous sources, enabling persistent and verifiable reasoning. The key distinction from standard RAG is that retrieved context is semantically structured and constraint-validated rather than embedding-similarity ranked.

RAG Agents Knowledge-graphs Tooling

📑 arXiv Apr 22

Supplement Generation Training for Enhancing Agentic Task Performance

Supplement Generation Training (SGT) trains a small LLM to produce task-specific supplemental text prepended to the input of a larger frozen LLM, improving downstream task performance without modifying the large model. This decouples task-specific adaptation from expensive full model retraining, making it practical to update only the lightweight supplement generator as base models evolve. The approach is framed as an alternative to repeated post-training of frontier models for agentic tasks.

Agents Fine-tuning Inference Prompting

📑 arXiv Apr 22

Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

Textual Parameter Graph Optimization (TPGO) models a multi-agent system as a graph of optimizable nodes (agents, tools, workflows) and derives structured natural-language "textual gradients" from execution traces to guide iterative optimization. Critically, the optimizer itself learns from accumulated optimization history, making the framework self-improving rather than static. This addresses the lack of structural awareness and adaptability in flat prompt-tuning approaches to MAS optimization.

Agents Prompting Tooling Fine-tuning

💬 Reddit Apr 22

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

Pairing Qwen3.6-35B with the 'little-coder' agent scaffold achieves 78.7% on the Polyglot coding benchmark, landing in the public top 10 and competitive with leading cloud models. The same scaffold previously lifted a 9B Qwen model from 19.11% to 45.56%, suggesting a significant portion of the local-vs-cloud performance gap is attributable to scaffold/harness mismatch rather than model capability alone.

Agents Benchmarks Code Gen Open Weights

🟢 OpenAI Apr 22

Introducing workspace agents in ChatGPT

OpenAI introduces workspace agents in ChatGPT: Codex-powered cloud agents that can automate multi-step workflows across tools on behalf of teams. They run asynchronously in the cloud, scoped to a workspace with access controls. This extends Codex beyond single-shot code generation into persistent, team-level agentic task execution.

Agents Code Gen Infrastructure Deployment

🟢 OpenAI Apr 22

Speeding up agentic workflows with WebSockets in the Responses API

OpenAI engineering post details how the Codex agent loop uses WebSockets in the Responses API to reduce per-request connection overhead and leverages connection-scoped caching to cut model latency in multi-turn agentic workflows. The post quantifies improvements but frames them around the specific Codex loop design. Practical reference for anyone building low-latency agents on top of the Responses API.

Agents Inference Infrastructure Deployment

💬 Reddit Apr 22

Why I Stopped Building Autonomous Agents for Clients

A practitioner's post-mortem on building fully autonomous multi-agent systems for clients: unpredictable recursive loops, runaway API costs ($200 in 2 hours), and zero client tolerance for black-box failures pushed the author toward human-in-the-loop, deterministic workflows instead. The core argument — autonomy is a liability for most business use cases — is grounded in specific failure modes rather than theory.

Agents Deployment Evaluation

📑 arXiv Apr 22

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents (ICLR 2026, MIT)

MEM1 trains agents end-to-end via RL to compress and update an internal memory state at each step, maintaining constant context size across arbitrarily long multi-turn tasks. Unlike RAG or full-context retention, the memory management policy itself is learned. Demonstrated on multi-turn web and tool-use tasks; from MIT, accepted ICLR 2026.

Agents Training Reasoning Inference

🤗 Hugging Face Apr 22

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillLearnBench is the first benchmark for continual skill learning in LLM agents, covering 20 verified tasks across 15 sub-domains with evaluation at three levels: skill quality, execution trajectory, and task outcome. Tested methods include one-shot learning, self/teacher feedback, and skill-creator approaches; all improve over the no-skill baseline but none achieves consistent gains across domains. Highlights that automatic skill acquisition for agents remains an unsolved problem despite recent progress.

Agents Benchmarks Evaluation Continual-learning

🐙 GitHub Apr 20

cosmicstack-labs/mercury-agent: Soul-driven AI agent with permission-hardened tools, token budgets, and multi-channel access. Runs 24/7 from CLI or Telegram.

CLI/Telegram-accessible AI agent framework with permission-scoped tools, token budget enforcement, and 24/7 uptime. Packages a "soul" config (personality/behavioral constraints) alongside access control primitives. Thin on novel technical depth — primarily a structured agent harness.

Agents Tooling Infrastructure

💬 Reddit Apr 20

Spent a weekend actually understanding and building Karpathy's "LLM Wiki" — here's what worked, what didn't

A hands-on build report on Karpathy's 'LLM Wiki' concept — pre-processing sources into a structured, interlinked markdown wiki rather than retrieving raw chunks at query time. Synthesis and cross-document reasoning questions improve noticeably versus RAG, but the approach struggles with scale, update latency, and source conflicts. Honest tradeoff analysis rather than a benchmark.

RAG Agents Prompting

💬 Reddit Apr 19

The gap between what technical and non-technical people get from AI is huge now

A Reddit thread observes that the practical capability gap between technical and non-technical AI users has widened sharply: non-technical users largely treat LLMs as search, while technical users leverage agents, computer use, Claude Code, and model selection. The post notes that nearly all recent model improvements are coding-focused, leaving general users with little perceived change. Reflects a real bifurcation in who captures value from frontier model advances.

Agents Code Gen Deployment

💬 Reddit Apr 16

Qwen3.6-35B-A3B released!

Qwen3.6-35B-A3B is a sparse MoE model with 35B total and only 3B active parameters, released under Apache 2.0. Claims agentic coding performance on par with models 10× its active size, with both multimodal thinking and non-thinking modes. Efficient active-parameter footprint makes it practical for inference on constrained hardware.

Models Open Weights Multimodal Agents

🐙 GitHub Apr 16

TheArcForge/UniClaude: Claude Code, natively inside Unity Editor. A dockable chat window with full project awareness, 60+ MCP tools, and zero alt-tabbing.

UniClaude embeds Claude Code directly into the Unity Editor as a dockable chat window, giving it full project awareness and access to 60+ MCP tools without leaving the editor. Targets the context-switching friction that plagues game dev AI workflows. Essentially a Unity-native MCP client wired to Claude.

Agents Tooling Code Gen Infrastructure

🟢 OpenAI Apr 16

⭐ Editor's Pick

OpenAI Agents SDK next evolution: native sandbox execution, model-native harness, configurable memory

OpenAI's Agents SDK gained native sandbox execution and a model-native harness (April 16) for secure, long-running file/tool agents, plus configurable memory and sandbox-aware orchestration. Version 0.4 (April 5) added MCP tool-use and streaming agent handoffs, making SDK-built agents first-class MCP consumers alongside Claude and Cursor. The combined updates meaningfully close the gap between the SDK and production-grade agent frameworks.

Agents Infrastructure Tooling Deployment

📝 Blog Apr 14

r/LocalLLaMA April 2026 community consensus: Qwen 3.5 most recommended family; Qwen3-Coder-Next sweeps local coding

April 2026 r/LocalLLaMA community consensus (143+ posts) names Qwen 3.5 as the most broadly recommended local model family, with Qwen3-Coder-Next as the near-unanimous pick for coding. MiniMax M2.5/M2.7 surface as the go-to for agentic/tool-heavy workloads; Gemma 4 gains traction for general local use; GLM-5/4.7 enters the best-overall conversation.

Open Weights Models Benchmarks Agents

🟧 Hacker News Apr 13

⭐ Editor's Pick

Anthropic Restricts "Mythos Preview" After Autonomous Zero-Day Exploitation Across All Major OSes and Browsers

Anthropic restricted its Mythos Preview model after it autonomously discovered and exploited zero-day vulnerabilities across all major OSes and browsers. Palo Alto Networks assessed similar capabilities as weeks-to-months from broader proliferation; CrowdStrike's 2026 threat report clocked average eCrime breakout at 29 minutes, Mandiant's M-Trends at 22-second adversary hand-off. A sharp illustration of the gap between lab capability and safe deployment for capability-frontier models.

Safety Agents

Ⓜ️ Meta AI Apr 8

Meta Muse Spark: first model from Meta Superintelligence Labs, proprietary pivot from Llama

Meta Superintelligence Labs' first model, Muse Spark, is a small, fast proprietary model with native multimodal perception and multi-agent parallel subagent execution—a sharp departure from Meta's Llama open-source strategy. Led by Alexandr Wang, it powers the revamped Meta AI app with Instant and Thinking modes and is rolling out across WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban glasses. API access is restricted to select partners only.

Models Multimodal Agents Industry

📝 Blog Apr 4

Components of a Coding Agent

Raschka breaks down the practical anatomy of a coding agent into three components: tool use (file I/O, shell, search), memory (in-context vs. external), and repository-level context management. Written as a grounding companion to his LLM architecture series, it maps abstract agent design concepts onto how systems like Claude Code and Codex actually operate.

Agents Code Gen Tooling

📝 Blog Mar 16

What Comes Next with Open Models

Lambert argues the open-closed performance gap will widen in 2026 because closed models are accumulating advantages on long-horizon, domain-specific tasks with non-public training data. Proposes a three-class taxonomy: true closed frontier, open frontier, and small specialized open models. Predicts the highest-impact open models will be narrow, fast, cheap sub-agents used as tools inside closed-model pipelines.

Open Weights Agents Models Inference

📝 Blog Jan 21

Get Good at Agents

Lambert documents a real multi-agent coding workflow — GPT-5 Pro for planning, Claude Code with Opus 4.5 for implementation, Codex with GPT-5.2 for high-thinking-effort tasks — and argues that directing parallel agents on open-ended tasks is replacing individual grind as the primary work mode. The thesis: scoping and directing agents is the durable skill edge, not raw effort.

Agents Code Gen Prompting