🍡 feedmeAI
← All topics
Agents 98 items

Everything Agents

📑 arXiv 1h ago

Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

Framework evaluates agentic systems across LLM, Memory, Tools, and Environment dimensions using static analysis, dynamic monitoring, and judge-based evaluation to detect policy violations beyond task completion. Based on CloudOps production deployment where success metrics masked compliance failures. Addresses gap in current benchmarks that measure outcomes but not process adherence.

📑 arXiv 1h ago

GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations

GUIDE separates lightweight acting model for real-time spacecraft control from offline reflection that updates a 'playbook' from prior trajectories, demonstrating LLMs can adapt operational strategies without weight updates in safety-critical domains. Shows context evolution in LLM agents functions as policy search over structured decision rules in deployment-constrained environments.

📑 arXiv 1h ago

Multi-Agent Reflexion (MAR): Diverse Reasoning Personas Improve LLM Agents

Multi-Agent Reflexion uses diverse reasoning personas with separate judge model to synthesize critiques, improving HotPotQA by 3 points and HumanEval by 6.2 points. Separates acting, diagnosing, critiquing, and aggregating to reduce shared blind spots in single-agent self-reflection. Addresses systematic limitation where solo agents repeat misconceptions without external correction signals.

🟢 OpenAI 1d ago

OpenAI Codex Major Update

OpenAI Codex expanded beyond coding to include computer use, web workflows, image generation, memory, and automations. The updated developer app adds PR reviews, multi-file/terminal viewing, SSH devbox connections, and in-app browsing, serving 3+ million developers weekly.

💬 Reddit 2d ago

Qwen 3.6 35B crushes Gemma 4 26B on my tests

User benchmark comparing Qwen 3.6 35B against Gemma 4 26B on 30k-line codebase with 37 intentional bugs and PDF analysis tasks shows Qwen significantly outperforming across agentic capabilities, coding, image-to-text, instruction following, and reasoning. Both models tested at Q4_K_XL quantization for fair comparison.

📑 arXiv 2d ago

ChemGraph-XANES: An Agentic Framework for XANES Simulation and Analysis

ChemGraph-XANES automates X-ray absorption near-edge structure simulation workflows using a LangGraph/LangChain-based agentic framework that handles natural-language task specification, structure acquisition, FDMNES execution, and provenance-aware data curation. Built on ASE, FDMNES, and Parsl, it addresses workflow complexity constraints that limit computational XANES deployment at scale.

📑 arXiv 2d ago

MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

MARCH emulates the professional hierarchy of radiology departments using a multi-agent framework with specialized roles: a Resident Agent for initial drafting, Fellow Agents for retrieval-augmented revision, and an Attending Agent orchestrating iterative consensus. The approach addresses clinical hallucinations and lack of verification in automated 3D CT report generation by mimicking collaborative clinical workflows.

📑 arXiv 2d ago

The Relic Condition: When Published Scholarship Becomes Material for Its Own Replacement

Extracted the scholarly reasoning systems of two prominent humanities scholars from published corpora, converted them into structured inference-time constraints for LLMs, and tested whether resulting scholar-bots could perform doctoral supervision, peer review, and lecturing at expert quality. Expert assessment found outputs met appointment-level quality standards, raising questions about knowledge work automation from public scholarship alone.

📑 arXiv 2d ago

Veritas-RPM: Provenance-Guided Multi-Agent False Positive Suppression for Remote Patient Monitoring

Veritas-RPM uses a five-layer multi-agent architecture (ground-truth assembly, anomaly detection, specialist routing, domain specialists, and conflict resolution) to suppress false positives in remote patient monitoring. Evaluated on 530 synthetic patient epochs across 98 documented false-positive scenarios, it reports True Suppression Rate, False Escalation Rate, and Indeterminate Rate metrics.

📑 arXiv 2d ago

DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition

DiZiNER simulates pilot annotation processes where multiple heterogeneous LLMs act as annotators and supervisors to refine instructions for zero-shot NER. The framework identifies systematic errors by generating disagreements between models, mirroring how human annotation resolves inconsistencies to improve zero-shot performance toward supervised baselines.

📑 arXiv 2d ago

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

Discover And Prove (DAP) introduces 'Hard Mode' automated theorem proving where systems must independently discover answers before constructing formal proofs, unlike standard benchmarks that embed answers in statements. Releases MiniF2F-Hard and FIMO-Hard benchmarks with expert reannotations, and an agentic framework using LLM natural-language reasoning with self-reflection for answer discovery.

📝 Blog 3d ago

Speculative Decoding Shines for Agentic Use Cases

Speculative decoding uses a smaller draft model to generate candidate tokens that a larger target model validates in a single pass, providing significant speedup for agentic workloads heavy on tool calls and structured outputs without quality loss. Cloudflare reports this is particularly effective for coding agents and API integration tasks where tool calling volume is high.

🐙 GitHub 3d ago

yzhao062/anywhere-agents: One config to rule all your AI agents: portable (every project, every session), effective (curated writing, routing, skills), and safer (destructive-command guard).

Anywhere-agents is a configuration management tool for AI agents emphasizing portability across projects, curated writing/routing/skills capabilities, and safety via destructive-command guards. Single config approach unifies agent behavior management. Addresses agent configuration consistency and safety concerns.

📑 arXiv 3d ago

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

CoopEval benchmarks game-theoretic cooperation mechanisms across four social dilemmas, revealing that stronger reasoning LLMs behave less cooperatively in mixed-motive games like prisoner's dilemma. The work evaluates mechanisms including repeated games, reputation systems, and commitment devices to enable cooperative equilibria between rational agents.

📑 arXiv 3d ago

Agentic Microphysics: A Manifesto for Generative AI Safety

Proposes "agentic microphysics" methodology for analyzing safety risks that emerge from structured interactions between AI agents rather than individual model behavior. The framework bridges the gap between single-agent analysis and aggregate outcomes by focusing on communication, observation, and mutual influence mechanisms that drive population-level risks.

📑 arXiv 3d ago

Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

Meituan introduces Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that simulates group-level user behavior for merchant strategy evaluation by mining transferable decision policies from behavioral trajectories. The approach addresses information incompleteness and mechanism duality by anchoring an LLM-based reasoning branch with behavioral policies to prevent over-rationalization. This enables scalable counterfactual evaluation without costly online experiments.

📑 arXiv 3d ago
★ High Signal

Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

Scepsy is a serving system for multi-LLM agentic workflows that schedules arbitrary agent frameworks onto GPU clusters under oversubscription. It exploits the observation that while end-to-end workflow latencies are unpredictable, the relative execution time shares of each LLM remain stable across runs. Enables efficient serving of complex agentic workflows at target throughput with low latency.

📑 arXiv 3d ago

Agent-Aided Design for Dynamic CAD Models

Agent-Aided Design systems use LLMs in a feedback loop to write CAD code, compile models, visualize results, and iteratively refine designs, but cannot yet generate complex 3D assemblies with moving parts like pistons or scissors. This work identifies the capability gap preventing these training-free agentic systems from impacting industrial manufacturing. Addresses the transition from static CAD objects to dynamic mechanical assemblies.

📑 arXiv 3d ago

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

OpenMobile is an open-source framework for synthesizing high-quality mobile agent task instructions and trajectories, achieving nearly 70% success on AndroidWorld. Features scalable task synthesis using global environment memory and policy-switching strategy alternating between learner and expert models during trajectory rollout. Makes training recipes transparent unlike closed leading models.

📑 arXiv 3d ago

Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap

Atropos optimizes cost-benefit trade-offs for LLM agents using self-consistency by predicting when to terminate cheaper Small Language Model inference early and hotswap to larger commercial models. The system analyzes structural properties of inference paths merged into graphs to decide when local SLMs suffice versus when expensive API calls are needed.

📑 arXiv 3d ago

Autogenesis: A Self-Evolving Agent Protocol

Autogenesis Protocol (AGP) standardizes self-evolving agent systems by modeling prompts, agents, tools, environments, and memory as protocol-registered resources with lifecycle management and version tracking. The Resource Substrate Protocol Layer decouples what evolves from how evolution occurs, addressing brittleness in existing protocols like A2A and MCP.

💬 Reddit 3d ago

Qwen3.6-35B-A3B released!

Qwen3.6-35B-A3B is a sparse MoE model with 35B total parameters and 3B active, released under Apache 2.0. The model matches agentic coding performance of models 10x its active size and includes multimodal perception with thinking and non-thinking modes.

🐙 GitHub 3d ago

TheArcForge/UniClaude: Claude Code, natively inside Unity Editor. A dockable chat window with full project awareness, 60+ MCP tools, and zero alt-tabbing.

UniClaude integrates Claude directly into Unity Editor as a dockable window with full project context awareness and 60+ MCP tools. Eliminates context switching during game development by embedding the AI assistant natively in the IDE. Provides workflow-specific tooling for game developers working in Unity.

🟢 OpenAI 3d ago

Codex for (almost) everything

OpenAI's Codex app for macOS and Windows now includes computer use capabilities, in-app browsing, image generation, memory, and plugins. The update transforms Codex from a code-focused assistant into a multi-capability developer productivity platform.

🐙 GitHub 3d ago

GainSec/AutoProber: Hardware hacker’s flying probe automation stack for agent-driven target discovery, microscope mapping, safety-monitored CNC motion, probe review, and controlled pin probing.

Agent-driven hardware reverse engineering automation stack controlling flying probe systems for PCB analysis. Combines target discovery, microscope mapping, safety-monitored CNC motion, probe review, and controlled pin probing. Demonstrates AI agents extending beyond software into physical hardware hacking workflows.

🔶 Anthropic 4d ago
★ High Signal

Claude Opus 4.7 - Major Model Release

Claude Opus 4.7 delivers 13% improvement on coding benchmarks with enhanced vision for higher-resolution images and new effort controls/task budgets for autonomous development. Powers upgraded Claude Code review tools for long-running software engineering tasks. Introduces task-level resource management for extended autonomous coding workflows.

🐙 GitHub 4d ago

GitHub Copilot Adds Claude Opus 4.7

GitHub Copilot adding Claude Opus 4.7 with stronger multi-step task performance and more reliable agentic execution. Launches with promotional 7.5× premium request multiplier until April 30th, replacing Opus 4.5 and 4.6 for Copilot Pro+ users.

🤗 Hugging Face 4d ago

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Corpus2Skill distills document corpora into hierarchical skill directories that LLM agents navigate rather than passively retrieve, addressing RAG's limitation of treating models as passive consumers. The system clusters documents offline into a navigable tree with LLM-written summaries at each level, giving agents a bird's-eye corpus view for better evidence synthesis.

🤗 Hugging Face 4d ago

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.

🤗 Hugging Face 4d ago

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

MM-WebAgent is a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. Jointly optimizes global layout, local multimodal content, and their integration to produce coherent and visually consistent webpages, addressing style inconsistency in isolated element generation.

🤗 Hugging Face 4d ago

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent is a tool-using AI agent for chest CT interpretation that generates reports through a stepwise, interpretable process with fully inspectable traces of intermediate decisions and tool interactions. Improves on CT-Chat VLM baseline across three dimensions while allowing clinicians to examine how findings are derived rather than being passive observers.

🐙 GitHub 4d ago

mikepapadim/london-property-hunt-public: Automated London flat/room hunt powered by Claude Code + Claude in Chrome + Gmail MCP. Scrapes 4 rental platforms on a cron, deduplicates via spreadsheet, prioritises HIGH/MED/LOW, and emails ready-to-send outreach.

Automated London rental property hunting system combining Claude Code, Claude in Chrome, and Gmail MCP. Scrapes four rental platforms on cron, deduplicates via spreadsheet, prioritizes listings as HIGH/MED/LOW, and generates ready-to-send outreach emails. Demonstrates practical agent orchestration for real-world automation tasks.

🔶 Anthropic 5d ago

Anthropic Claude Code Desktop App Redesign

Anthropic redesigned Claude Code desktop app with parallel session management sidebar, integrated terminal, in-app file editor, and Routines—automation running on schedules, API calls, or GitHub events without active sessions. Available for Pro, Max, Team, and Enterprise users on macOS and Windows.

🟢 OpenAI 5d ago
★ High Signal

OpenAI Agents SDK Evolution with Native Sandbox Execution

OpenAI's Agents SDK update adds native sandbox execution and model-native harness for building production-grade agents with improved safety and execution isolation. Represents a shift from experimental prototypes to production-ready agentic workflows with support for long-running agents working across files and tools.

🟢 OpenAI 5d ago
★ High Signal

OpenAI Codex Major Update - Expanded Computer Use

OpenAI Codex expands from coding to full computer use with web workflows, multi-step planning, autonomous actions, and audio-visual processing for 3M+ weekly developers. Now handles PR reviews, multiple file/terminal views, SSH connections, and in-app browsing. Shift from code generation tool to general-purpose computer control agent.

📝 Blog 5d ago

AI Weekly: Agent-to-Agent Protocol Hits 1-Year Anniversary with 150+ Organizations

Google's Agent-to-Agent Protocol reached 150+ organizations and production deployments in Azure AI Foundry and Amazon Bedrock AgentCore at 1-year milestone. v1.0 added Signed Agent Cards for cryptographic identity verification between agents; combined with IBM's merged Agent Communication Protocol and AP2 commerce extension, it now covers full lifecycle from tool access to delegation to payments.

📝 Blog 5d ago
★ High Signal

Claude Code Used to Find 23-Year-Old Linux Kernel Vulnerability

Claude Code discovered a 23-year-old remotely exploitable heap buffer overflow in Linux kernel's NFS driver, with five vulnerabilities confirmed. Linux maintainers report AI bug reports shifted from "slop to legitimate findings" about a month ago, with valid security reports increasing from 2-3/week to 5-10/day—marking a capability inflection point for AI-assisted vulnerability discovery.

📝 Blog 5d ago

Latent Space: Notion Custom Agents - Building Production AI

Notion rebuilt Custom Agents 4-5 times before production launch due to early failures from lack of tool-calling standards, short context, and unreliable models. "Agent Lab" thesis: time roadmap carefully to avoid swimming upstream against model limitations while building early enough. Practical lessons on when to ship agent features based on foundation model maturity.

📝 Blog 5d ago

Latent Space: Notion's Journey Building Custom AI Agents

Notion rebuilt Custom Agents 4-5 times before production, revealing early agent attempts failed due to lack of tool-calling standards and short context windows. Their 'Agent Lab' thesis focuses on building product systems around frontier capabilities, with coding agents viewed as the kernel of future 'software factories' comprising spec/code/test/review agents.

🤗 Hugging Face 6d ago

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Analysis of Claude Code's TypeScript source code and comparison with OpenClaw identifies five core human values (decision authority, safety, reliable execution, capability amplification, contextual adaptability) traced through thirteen design principles to implementation choices. The core architecture is a simple while-loop calling the model, running tools, and returning results—demonstrating how design philosophy shapes agentic system architecture.

🤗 Hugging Face 6d ago

Towards Autonomous Mechanistic Reasoning in Virtual Cells

VCR-Agent is a multi-agent framework that generates mechanistic action graphs to represent biological reasoning in virtual cells, enabling verification and falsification of LLM-generated explanations. The approach releases VC-TRACES, a dataset of verified biological mechanisms, addressing the challenge of factually grounded scientific explanations from LLMs in open-ended domains like biology.

🐙 GitHub 6d ago

MaxKmet/idea-validation-agents: AI agents that act as your personal venture analyst - from startup idea brainstorming to full validation and go-to-market strategy. Built for developers who'd rather validate in 10 minutes than regret in six months. Powered by Claude Code, OpenAI Codex, and Cursor.

Open-source AI agent system that automates startup idea validation from brainstorming through go-to-market strategy, powered by Claude, OpenAI, and Cursor. Targets developers seeking rapid validation in 10 minutes instead of months-long manual processes.

🐙 GitHub 6d ago

inhouseseo/superseo-skills: 11 Claude skills for SEO: page audits, linkbuilding, article writing, E-E-A-T audits, semantic gap analysis, link building. Methodology from Koray Tuğberk, Kyle Roof, and Lily Ray, plus a generation-time anti-AI-slop ruleset. Production-tested at InhouseSEO

InhouseSEO releases 11 production-tested Claude skills for SEO workflows including page audits, E-E-A-T analysis, semantic gap detection, and article writing with anti-AI-slop generation rules. Built on methodology from industry practitioners Koray Tuğberk, Kyle Roof, and Lily Ray.

💬 Reddit 6d ago

Gemma 4 - lazy model or am I crazy? (bit of a rant)

Gemma 4 26B MoE shows reluctance to use tools or web search, defaulting to internal knowledge and performing minimal searches when explicitly requested. Community feedback on model's agentic capabilities despite strong benchmarks. Highlights gap between stated capabilities and practical tool use.

✍️ Simon Willison 1w ago

Simon Willison: Exploring the Servo Crate with Claude Code

Simon Willison uses Claude Code to explore Servo v0.1.0 Rust crate, building CLI screenshot tool and investigating WebAssembly compilation autonomously. Demonstrates "agentic engineering" workflow where developer tasks AI with discovering library capabilities and building working tools. Evolution from code completion to exploratory development assistance.

📑 arXiv 1w ago

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

SkillClaw enables collective skill evolution across multi-user LLM agent ecosystems by continuously aggregating interaction trajectories and autonomously refining skills via an agentic evolver, achieving 88% improvement after 6 rounds and +42.1% on real-world tasks. It enables cross-user knowledge transfer without additional user effort, solving the inefficiency where users repeatedly develop similar workflows independently.

🧠 DeepMind 2w ago
★ High Signal

Google Gemma 4 - Open Model Family Release

Gemma 4 family (31B Dense, 26B MoE variants) released under Apache 2.0 with 256K context, native vision/audio, and competitive coding ELO jumping from 110 to 2150—a 20x improvement. The 31B model outperforms models 20x larger while enabling agentic skills on edge devices. First open-weights model family combining multimodal input, extended context, and elite coding performance at edge-deployable scale.

📝 Blog 2w ago

Simon Willison on Lenny's Podcast: AI State of the Union

Simon Willison identifies November 2025 as the inflection point when AI coding agents crossed from 'mostly works' to 'actually works' with GPT-5.2 and Opus 4.5 releases. Discusses dark factories, automation timelines, agentic engineering, and his transition from traditional software engineering to AI-native development.

📑 arXiv 2w ago

Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

LLM multi-agent systems spontaneously develop power-law distributions in cognitive influence, forming "intellectual elites" where a small fraction of agents disproportionately shape collective decisions without explicit design. This emergent stratification mirrors human social dynamics and challenges assumptions about egalitarian multi-agent collaboration. Critical implications for fairness and reliability in decision-making systems.

📑 arXiv 2w ago

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

Proactive Agent Research Environment simulates active users to evaluate AI assistants that anticipate needs and initiate actions rather than just responding to queries. Existing benchmarks lack realistic user simulation for testing proactive behaviors like timely suggestions and anticipatory information gathering. Bridges the gap between passive query-response evaluation and true assistant capabilities needed in high-stakes domains.

📝 Blog 2w ago

Claude Code Architectural Leak Reveals Three-Layer Memory System and Tool Design

Leaked Claude Code source reveals three-layer memory architecture (file-read deduplication, structured session memory), dedicated repository navigation tools (Grep, Glob, LSP) instead of relying on model context, and forked subagents for parallelized background analysis. Demonstrates that coding agent performance stems from careful harness engineering around the model rather than just model intelligence alone.

📑 arXiv 3w ago

Simulating Human Cognition: Heartbeat-Driven Autonomous Thinking Activity Scheduling for LLM-based AI systems

Introduces heartbeat-driven metacognitive scheduling for LLM agents that learns when to activate cognitive modules (Planner, Critic, Recaller, Dreamer) from temporal patterns rather than hard-coded rules. First approach treating agent control as a learned scheduling problem, enabling proactive self-improving behavior through meta-learning from historical execution logs.

📑 arXiv Jan 18

Agentic Reasoning for Large Language Models

Comprehensive survey organizing agentic reasoning along three dimensions: foundational (planning, tool use, search), self-evolving (feedback, memory, adaptation), and collective multi-agent reasoning. Distinguishes in-context reasoning from post-training reasoning and provides unified taxonomy bridging thought and action across science, robotics, healthcare, and mathematics.