Framework evaluates agentic systems across LLM, Memory, Tools, and Environment dimensions using static analysis, dynamic monitoring, and judge-based evaluation to detect policy violations beyond task completion. Based on CloudOps production deployment where success metrics masked compliance failures. Addresses gap in current benchmarks that measure outcomes but not process adherence.
Multi-agent LLM systems spontaneously develop power law distributions in knowledge and influence, mirroring human intellectual hierarchies. Agent societies exhibit emergent specialization and social stratification. First empirical evidence of collective social dynamics beyond individual agent capabilities.
GUIDE separates lightweight acting model for real-time spacecraft control from offline reflection that updates a 'playbook' from prior trajectories, demonstrating LLMs can adapt operational strategies without weight updates in safety-critical domains. Shows context evolution in LLM agents functions as policy search over structured decision rules in deployment-constrained environments.
Multi-Agent Reflexion uses diverse reasoning personas with separate judge model to synthesize critiques, improving HotPotQA by 3 points and HumanEval by 6.2 points. Separates acting, diagnosing, critiquing, and aggregating to reduce shared blind spots in single-agent self-reflection. Addresses systematic limitation where solo agents repeat misconceptions without external correction signals.
Neurosymbolic architecture grounds AI agents in domain ontologies for regulated industries, evaluated across 600 runs in 5 sectors including Vietnamese-language domains. Ensures agent reasoning aligns with compliance requirements and domain constraints. Bridges symbolic knowledge representation with neural reasoning for safety-critical enterprise deployment.
SkillClaw enables LLM agent skills to continuously evolve through collective cross-user interaction experiences via an autonomous 'agentic evolver' that refines and updates skills, achieving +42.1% improvement. Treats agent capabilities as living artifacts that improve through collective use rather than static functions, representing a shift toward learning agent ecosystems.
Community discussion on replacing Claude Opus with Qwen-35B-A3B for coding agent workflows on M5 Max hardware. Users weighing Opus's reasoning edge against Qwen's local deployment and cost benefits for daily development tasks.
OpenAI Codex expanded beyond coding to include computer use, web workflows, image generation, memory, and automations. The updated developer app adds PR reviews, multi-file/terminal viewing, SSH devbox connections, and in-app browsing, serving 3+ million developers weekly.
Benchmark comparing Claude and Gemini on the laden knight's tour problem, a weighted variant requiring optimal pathfinding with accumulating costs. Tests coding agents on combinatorial optimization task combining movement constraints with dynamic cost calculation.
Berkeley researchers achieved near-perfect scores on major AI agent benchmarks (SWE-bench, WebArena, FieldWorkArena, Terminal-Bench) without solving tasks, using exploits ranging from trivial to sophisticated. Exposes that evaluations weren't designed to resist systems optimizing for scores rather than actual task completion.
Production LLM deployments span automated bureaucracy monitoring (extracting structured data from German government sites), multi-agent sales automation with 8 sub-agents and critic loops, and corporate knowledge RAG using Qdrant+LlamaIndex. Key insight: LLMs enable processing unstructured data at scale previously impossible.
User benchmark comparing Qwen 3.6 35B against Gemma 4 26B on 30k-line codebase with 37 intentional bugs and PDF analysis tasks shows Qwen significantly outperforming across agentic capabilities, coding, image-to-text, instruction following, and reasoning. Both models tested at Q4_K_XL quantization for fair comparison.
ChemGraph-XANES automates X-ray absorption near-edge structure simulation workflows using a LangGraph/LangChain-based agentic framework that handles natural-language task specification, structure acquisition, FDMNES execution, and provenance-aware data curation. Built on ASE, FDMNES, and Parsl, it addresses workflow complexity constraints that limit computational XANES deployment at scale.
MARCH emulates the professional hierarchy of radiology departments using a multi-agent framework with specialized roles: a Resident Agent for initial drafting, Fellow Agents for retrieval-augmented revision, and an Attending Agent orchestrating iterative consensus. The approach addresses clinical hallucinations and lack of verification in automated 3D CT report generation by mimicking collaborative clinical workflows.
Extracted the scholarly reasoning systems of two prominent humanities scholars from published corpora, converted them into structured inference-time constraints for LLMs, and tested whether resulting scholar-bots could perform doctoral supervision, peer review, and lecturing at expert quality. Expert assessment found outputs met appointment-level quality standards, raising questions about knowledge work automation from public scholarship alone.
Veritas-RPM uses a five-layer multi-agent architecture (ground-truth assembly, anomaly detection, specialist routing, domain specialists, and conflict resolution) to suppress false positives in remote patient monitoring. Evaluated on 530 synthetic patient epochs across 98 documented false-positive scenarios, it reports True Suppression Rate, False Escalation Rate, and Indeterminate Rate metrics.
AstroVLM is a multi-agent VLM system for diagnosing quality issues in astronomical imaging by handling complex underlying correlations across multidisciplinary subtasks. It addresses the time-intensive manual effort NASA and expert astronomers invest in quality diagnosis and error localization during the imaging process.
SocialGrid is an Among Us-inspired benchmark evaluating LLM agents on planning, task execution, and social reasoning in embodied multi-agent settings. Even GPT-OSS-120B achieves below 60% accuracy, with agents stuck in repetitive behaviors—revealing social reasoning remains a bottleneck even with planning assistance.
Qwen3.6-35B-UD at 2-bit K_XL quantization achieves 98.3% tool call success rate across 58 calls while processing 2.7M tokens on 16GB VRAM. Successfully converts research papers to web applications using llama.cpp on consumer laptop hardware. Demonstrates extreme quantization can maintain performance on complex multi-step tasks.
Agentic Verifier transforms reward modeling into multi-turn, tool-augmented deliberation using complementary forward and backward agents. Addresses error propagation and lack of grounding in complex domains by tracing solutions from premises to conclusions and re-checking conclusions against premises for comprehensive verification.
WORC (Weak-link Optimization for Reasoning and Collaboration) improves multi-agent LLM frameworks by systematically identifying and reinforcing performance-limiting agents rather than only enhancing high-capability agents. Addresses reasoning instability where individual agent errors amplify through collaboration, grounded in the weak-link principle.
Survey categorizing graph-LLM integration methods by purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, causal graphs), and integration strategy (prompting, augmentation, training, agent-based). Provides clarity on when and what types of graph representations enhance LLM capabilities.
Experience Compression Spectrum unifies agent memory, skills, and rules as points along a compression axis (5-20× for memory, 50-500× for skills, 1000×+ for rules). Framework addresses the critical bottleneck of managing accumulated experience in long-horizon, multi-session LLM agent deployments by reducing context consumption and retrieval latency.
DiZiNER simulates pilot annotation processes where multiple heterogeneous LLMs act as annotators and supervisors to refine instructions for zero-shot NER. The framework identifies systematic errors by generating disagreements between models, mirroring how human annotation resolves inconsistencies to improve zero-shot performance toward supervised baselines.
CoEvolve is an agent-data mutual evolution framework enabling LLM agents to improve through closed-loop, interaction-driven training. Extracts feedback signals like forgetting and uncertainty to identify failure-prone patterns, then uses LLM-based task synthesis to adapt the training data distribution alongside the agent.
Discover And Prove (DAP) introduces 'Hard Mode' automated theorem proving where systems must independently discover answers before constructing formal proofs, unlike standard benchmarks that embed answers in statements. Releases MiniF2F-Hard and FIMO-Hard benchmarks with expert reannotations, and an agentic framework using LLM natural-language reasoning with self-reflection for answer discovery.
📝 Blog 3d ago
★ High Signal
NVIDIA's Nemotron 3 Super is a 120B/12B-active MoE model with hybrid Mamba-Attention architecture scoring 60.47% on SWE-Bench Verified—the highest open-weight coding score at launch. Features 1M context, 2.2x throughput improvement, and native speculative decoding for efficient agentic reasoning.
Speculative decoding uses a smaller draft model to generate candidate tokens that a larger target model validates in a single pass, providing significant speedup for agentic workloads heavy on tool calls and structured outputs without quality loss. Cloudflare reports this is particularly effective for coding agents and API integration tasks where tool calling volume is high.
Qwen 3.6 introduces a preserve_thinking flag that prevents KV cache invalidation by maintaining reasoning context across turns. This improves cache reuse in agent scenarios, reduces token consumption from redundant reasoning, and fixes a template issue that caused cache invalidation in Qwen 3.5.
Command-line tool claims to accelerate Android app development 3x when used with AI coding agents. Streamlines agent-based mobile development workflows.
MM-WebAgent uses hierarchical planning and iterative self-reflection to coordinate AIGC tools for webpage generation, jointly optimizing layout, multimodal content, and integration. Solves style inconsistency problems in prior approaches that generate visual elements independently, introducing a new multimodal webpage generation benchmark.
Anywhere-agents is a configuration management tool for AI agents emphasizing portability across projects, curated writing/routing/skills capabilities, and safety via destructive-command guards. Single config approach unifies agent behavior management. Addresses agent configuration consistency and safety concerns.
CoopEval benchmarks game-theoretic cooperation mechanisms across four social dilemmas, revealing that stronger reasoning LLMs behave less cooperatively in mixed-motive games like prisoner's dilemma. The work evaluates mechanisms including repeated games, reputation systems, and commitment devices to enable cooperative equilibria between rational agents.
Proposes "agentic microphysics" methodology for analyzing safety risks that emerge from structured interactions between AI agents rather than individual model behavior. The framework bridges the gap between single-agent analysis and aggregate outcomes by focusing on communication, observation, and mutual influence mechanisms that drive population-level risks.
Blue's Data Intelligence Layer orchestrates agents across multi-source, multi-modal data beyond single-database NL2SQL. Addresses iterative queries, heterogeneous data sources, and external knowledge requirements in enterprise compound AI systems.
RadAgent generates chest CT reports through stepwise tool-using with fully inspectable reasoning traces for clinical validation. Tool-augmented agent improves over 3D VLM baseline CT-Chat on clinical accuracy, groundedness, and radiologist efficiency across three evaluation dimensions.
Meituan introduces Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that simulates group-level user behavior for merchant strategy evaluation by mining transferable decision policies from behavioral trajectories. The approach addresses information incompleteness and mechanism duality by anchoring an LLM-based reasoning branch with behavioral policies to prevent over-rationalization. This enables scalable counterfactual evaluation without costly online experiments.
📑 arXiv 3d ago
★ High Signal
Scepsy is a serving system for multi-LLM agentic workflows that schedules arbitrary agent frameworks onto GPU clusters under oversubscription. It exploits the observation that while end-to-end workflow latencies are unpredictable, the relative execution time shares of each LLM remain stable across runs. Enables efficient serving of complex agentic workflows at target throughput with low latency.
Agent-Aided Design systems use LLMs in a feedback loop to write CAD code, compile models, visualize results, and iteratively refine designs, but cannot yet generate complex 3D assemblies with moving parts like pistons or scissors. This work identifies the capability gap preventing these training-free agentic systems from impacting industrial manufacturing. Addresses the transition from static CAD objects to dynamic mechanical assemblies.
OpenMobile is an open-source framework for synthesizing high-quality mobile agent task instructions and trajectories, achieving nearly 70% success on AndroidWorld. Features scalable task synthesis using global environment memory and policy-switching strategy alternating between learner and expert models during trajectory rollout. Makes training recipes transparent unlike closed leading models.
LLM agents autonomously evolve the ABC logic synthesis codebase by rewriting sub-components while preserving its single-binary execution model. The self-evolving framework operates on the entire integrated codebase and bootstraps using existing open-source synthesis components before iteratively improving through agent-driven code evolution.
Scoping review of 23 studies reveals fairness in multi-agent AI systems is superficially addressed, lacks normative foundations, and overlooks agent autonomy dynamics. Authors argue fairness must be embedded structurally throughout MAAI development lifecycles rather than added post-hoc, addressing gaps in an increasingly important but understudied area.
Atropos optimizes cost-benefit trade-offs for LLM agents using self-consistency by predicting when to terminate cheaper Small Language Model inference early and hotswap to larger commercial models. The system analyzes structural properties of inference paths merged into graphs to decide when local SLMs suffice versus when expensive API calls are needed.
CoGrid is a multi-agent grid simulation library with NumPy and JAX backends, paired with Multi-User Gymnasium (MUG) that converts simulations into interactive web experiments. The tools lower barriers for researchers studying human-AI interaction by supporting arbitrary numbers of humans and AI agents in both server-authoritative and peer-to-peer modes.
ProVoice-Bench introduces the first evaluation framework for proactive voice agents with 1,182 samples across four tasks measuring intervention and monitoring capabilities. State-of-the-art multimodal LLMs show significant performance gaps particularly in over-triggering and reasoning, revealing limitations in current proactive agent paradigms.
Autogenesis Protocol (AGP) standardizes self-evolving agent systems by modeling prompts, agents, tools, environments, and memory as protocol-registered resources with lifecycle management and version tracking. The Resource Substrate Protocol Layer decouples what evolves from how evolution occurs, addressing brittleness in existing protocols like A2A and MCP.
Qwen3.6-35B-A3B is a sparse MoE model with 35B total parameters and 3B active, released under Apache 2.0. The model matches agentic coding performance of models 10x its active size and includes multimodal perception with thinking and non-thinking modes.
Agentic framework for RTL timing optimization using LLMs with tool-grounded self-improvement and reusable optimization skills. Evaluated on realistic RTL designs with industrial-grade tools rather than manually degraded toy examples. Moves beyond coarse design-level feedback to fine-grained optimization through learned skills.
Examines explainability requirements for agentic AI in enterprise settings where low-code agent proliferation ("Agent Sprawl") outpaces governance capabilities. Proposes design-time and runtime explainability techniques from AI governance experts to address corporate concerns about autonomous agent decision-making and inter-agent communication.
UniClaude integrates Claude directly into Unity Editor as a dockable window with full project context awareness and 60+ MCP tools. Eliminates context switching during game development by embedding the AI assistant natively in the IDE. Provides workflow-specific tooling for game developers working in Unity.
OpenAI's Codex app for macOS and Windows now includes computer use capabilities, in-app browsing, image generation, memory, and plugins. The update transforms Codex from a code-focused assistant into a multi-capability developer productivity platform.
Agent-driven hardware reverse engineering automation stack controlling flying probe systems for PCB analysis. Combines target discovery, microscope mapping, safety-monitored CNC motion, probe review, and controlled pin probing. Demonstrates AI agents extending beyond software into physical hardware hacking workflows.
🔶 Anthropic 4d ago
★ High Signal
Claude Opus 4.7 delivers 13% improvement on coding benchmarks with enhanced vision for higher-resolution images and new effort controls/task budgets for autonomous development. Powers upgraded Claude Code review tools for long-running software engineering tasks. Introduces task-level resource management for extended autonomous coding workflows.
🔶 Anthropic 4d ago
★ High Signal
Claude Opus 4.7 achieves 87.6% on SWE-bench Verified (13% improvement) with 2x throughput on agentic tasks while maintaining $5/$25 per million token pricing and full 1M context window. The performance gains make it effectively cheaper per task despite unchanged nominal pricing. Higher-resolution vision capabilities included.
Ecom-RLVE introduces adaptive verifiable environments for training and evaluating e-commerce conversational agents with reinforcement learning. Provides structured simulation environments where agent actions can be verified against ground truth. Enables systematic development of domain-specific conversational AI for shopping and customer service scenarios.
GitHub Copilot adding Claude Opus 4.7 with stronger multi-step task performance and more reliable agentic execution. Launches with promotional 7.5× premium request multiplier until April 30th, replacing Opus 4.5 and 4.6 for Copilot Pro+ users.
Corpus2Skill distills document corpora into hierarchical skill directories that LLM agents navigate rather than passively retrieve, addressing RAG's limitation of treating models as passive consumers. The system clusters documents offline into a navigable tree with LLM-written summaries at each level, giving agents a bird's-eye corpus view for better evidence synthesis.
DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.
MM-WebAgent is a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. Jointly optimizes global layout, local multimodal content, and their integration to produce coherent and visually consistent webpages, addressing style inconsistency in isolated element generation.
RadAgent is a tool-using AI agent for chest CT interpretation that generates reports through a stepwise, interpretable process with fully inspectable traces of intermediate decisions and tool interactions. Improves on CT-Chat VLM baseline across three dimensions while allowing clinicians to examine how findings are derived rather than being passive observers.
UniDoc-RL uses reinforcement learning to unify retrieval, reranking, and visual perception in a single LVLM agent with hierarchical actions. The model progressively refines evidence from document-level retrieval to region-level cropping, enabling fine-grained visual semantics for complex reasoning tasks.
Automated London rental property hunting system combining Claude Code, Claude in Chrome, and Gmail MCP. Scrapes four rental platforms on cron, deduplicates via spreadsheet, prioritizes listings as HIGH/MED/LOW, and generates ready-to-send outreach emails. Demonstrates practical agent orchestration for real-world automation tasks.
Discussion analyzing whether AI agent operational costs are experiencing exponential growth similar to training costs. Examines infrastructure and inference expenses for agentic systems at scale. Raises concerns about economic sustainability of agent-based architectures.
Hugging Face analysis of VAKRA agent system covering reasoning patterns, tool use mechanisms, and common failure modes in agent architectures.
WorldSeed is a simulation engine where AI agents live autonomously with physical rules and information asymmetry. Scenarios defined in YAML allow emergent multi-agent storytelling with any agent framework.
Anthropic redesigned Claude Code desktop app with parallel session management sidebar, integrated terminal, in-app file editor, and Routines—automation running on schedules, API calls, or GitHub events without active sessions. Available for Pro, Max, Team, and Enterprise users on macOS and Windows.
SkillClaw enables LLM agent skills to evolve autonomously by aggregating interaction experiences across users, with an 'agentic evolver' that refines capabilities from real-world usage. Achieves +42.1% improvement by shifting from static, manually-engineered skills to continuously improving ones learned from collective deployment data.
🟢 OpenAI 5d ago
★ High Signal
OpenAI's Agents SDK update adds native sandbox execution and model-native harness for building production-grade agents with improved safety and execution isolation. Represents a shift from experimental prototypes to production-ready agentic workflows with support for long-running agents working across files and tools.
🟢 OpenAI 5d ago
★ High Signal
OpenAI Codex expands from coding to full computer use with web workflows, multi-step planning, autonomous actions, and audio-visual processing for 3M+ weekly developers. Now handles PR reviews, multiple file/terminal views, SSH connections, and in-app browsing. Shift from code generation tool to general-purpose computer control agent.
Google's Agent-to-Agent Protocol reached 150+ organizations and production deployments in Azure AI Foundry and Amazon Bedrock AgentCore at 1-year milestone. v1.0 added Signed Agent Cards for cryptographic identity verification between agents; combined with IBM's merged Agent Communication Protocol and AP2 commerce extension, it now covers full lifecycle from tool access to delegation to payments.
📝 Blog 5d ago
★ High Signal
Claude Code discovered a 23-year-old remotely exploitable heap buffer overflow in Linux kernel's NFS driver, with five vulnerabilities confirmed. Linux maintainers report AI bug reports shifted from "slop to legitimate findings" about a month ago, with valid security reports increasing from 2-3/week to 5-10/day—marking a capability inflection point for AI-assisted vulnerability discovery.
Notion rebuilt Custom Agents 4-5 times before production launch due to early failures from lack of tool-calling standards, short context, and unreliable models. "Agent Lab" thesis: time roadmap carefully to avoid swimming upstream against model limitations while building early enough. Practical lessons on when to ship agent features based on foundation model maturity.
Notion rebuilt Custom Agents 4-5 times before production, revealing early agent attempts failed due to lack of tool-calling standards and short context windows. Their 'Agent Lab' thesis focuses on building product systems around frontier capabilities, with coding agents viewed as the kernel of future 'software factories' comprising spec/code/test/review agents.
HiVLA decouples VLM semantic planning from motor control to preserve reasoning capabilities lost in end-to-end VLA fine-tuning. VLM planner generates subtask instructions with target bounding boxes, then flow-matching DiT translates grounded plans to physical actions for robotic manipulation.
r/LocalLLaMA consensus ranks Qwen 3.5 most broadly recommended, Gemma 4 showing strong buzz, GLM-5/4.7 near top of rankings, MiniMax M2.5/M2.7 for agentic workloads, DeepSeek V3.2 in top cluster. Qwen3-Coder-Next dominates for local coding. Community-driven practical guidance on deployed models.
Analysis of Claude Code's TypeScript source code and comparison with OpenClaw identifies five core human values (decision authority, safety, reliable execution, capability amplification, contextual adaptability) traced through thirteen design principles to implementation choices. The core architecture is a simple while-loop calling the model, running tools, and returning results—demonstrating how design philosophy shapes agentic system architecture.
VCR-Agent is a multi-agent framework that generates mechanistic action graphs to represent biological reasoning in virtual cells, enabling verification and falsification of LLM-generated explanations. The approach releases VC-TRACES, a dataset of verified biological mechanisms, addressing the challenge of factually grounded scientific explanations from LLMs in open-ended domains like biology.
Open-source AI agent system that automates startup idea validation from brainstorming through go-to-market strategy, powered by Claude, OpenAI, and Cursor. Targets developers seeking rapid validation in 10 minutes instead of months-long manual processes.
Curated collection of 50+ Claude Code skills, agents, and plugins organized by use case with recommendation ratings. Ready-to-use extensions for Claude-based development workflows.
Analysis of 1000+ OpenClaw deployments reveals minimal legitimate use cases beyond daily news digests, despite 250K GitHub stars and significant engineering investment. Users who spent weeks attempting production deployment found the tool connects to messaging apps and LLMs but lacks practical applications.
InhouseSEO releases 11 production-tested Claude skills for SEO workflows including page audits, E-E-A-T analysis, semantic gap detection, and article writing with anti-AI-slop generation rules. Built on methodology from industry practitioners Koray Tuğberk, Kyle Roof, and Lily Ray.
Asynkor provides file leasing coordination for AI agent teams via MCP server, preventing merge conflicts when multiple agents edit code. Works across IDEs without changing agent implementations.
Cloudflare integrates OpenAI's GPT-5.4 and Codex into Agent Cloud, enabling enterprises to build and deploy AI agents at scale. The partnership combines Cloudflare's infrastructure with OpenAI's latest models for production agentic workflows.
Gemma 4 26B MoE shows reluctance to use tools or web search, defaulting to internal knowledge and performing minimal searches when explicitly requested. Community feedback on model's agentic capabilities despite strong benchmarks. Highlights gap between stated capabilities and practical tool use.
Demonstrates that fairness can emerge as a property of multi-agent collaboration, potentially circumventing Arrow's impossibility theorem limitations in collective decision-making. This theoretical contribution suggests that distributed AI systems might achieve fair outcomes through collaboration mechanisms that single-agent or voting-based systems cannot.
KDnuggets recommends five books for building agentic AI systems, headlined by Chip Huyen's "AI Engineering" for its practical focus on production tradeoffs like latency vs. accuracy and cost vs. capability. The list targets practitioners shipping multi-agent orchestration, tool-calling, and memory management to production in 2026.
Simon Willison uses Claude Code to explore Servo v0.1.0 Rust crate, building CLI screenshot tool and investigating WebAssembly compilation autonomously. Demonstrates "agentic engineering" workflow where developer tasks AI with discovering library capabilities and building working tools. Evolution from code completion to exploratory development assistance.
SkillClaw enables collective skill evolution across multi-user LLM agent ecosystems by continuously aggregating interaction trajectories and autonomously refining skills via an agentic evolver, achieving 88% improvement after 6 rounds and +42.1% on real-world tasks. It enables cross-user knowledge transfer without additional user effort, solving the inefficiency where users repeatedly develop similar workflows independently.
Meta Muse Spark marks Meta's pivot from open-source to proprietary models, featuring multimodal perception, parallel subagent execution, and a contemplating mode. Built by Meta Superintelligence Labs, it offers competitive vision and language performance but lags in coding, representing Meta's first paid API model after Llama 4's poor reception.
MMLU and other 2024-dominant benchmarks now saturated (>95% on frontier models), relegated to "floor checks" rather than frontier separators. Frontier now decided by HLE, GPQA, MMLU-Pro, SWE-bench Pro, Terminal-Bench 2.0, BrowseComp for agentic tasks. Benchmark choice matters more than ever as academic standards become irrelevant for comparing top models.
🧠 DeepMind 2w ago
★ High Signal
Gemma 4 family (31B Dense, 26B MoE variants) released under Apache 2.0 with 256K context, native vision/audio, and competitive coding ELO jumping from 110 to 2150—a 20x improvement. The 31B model outperforms models 20x larger while enabling agentic skills on edge devices. First open-weights model family combining multimodal input, extended context, and elite coding performance at edge-deployable scale.
Simon Willison identifies November 2025 as the inflection point when AI coding agents crossed from 'mostly works' to 'actually works' with GPT-5.2 and Opus 4.5 releases. Discusses dark factories, automation timelines, agentic engineering, and his transition from traditional software engineering to AI-native development.
LLM multi-agent systems spontaneously develop power-law distributions in cognitive influence, forming "intellectual elites" where a small fraction of agents disproportionately shape collective decisions without explicit design. This emergent stratification mirrors human social dynamics and challenges assumptions about egalitarian multi-agent collaboration. Critical implications for fairness and reliability in decision-making systems.
Proactive Agent Research Environment simulates active users to evaluate AI assistants that anticipate needs and initiate actions rather than just responding to queries. Existing benchmarks lack realistic user simulation for testing proactive behaviors like timely suggestions and anticipatory information gathering. Bridges the gap between passive query-response evaluation and true assistant capabilities needed in high-stakes domains.
Leaked Claude Code source reveals three-layer memory architecture (file-read deduplication, structured session memory), dedicated repository navigation tools (Grep, Glob, LSP) instead of relying on model context, and forked subagents for parallelized background analysis. Demonstrates that coding agent performance stems from careful harness engineering around the model rather than just model intelligence alone.
Introduces heartbeat-driven metacognitive scheduling for LLM agents that learns when to activate cognitive modules (Planner, Critic, Recaller, Dreamer) from temporal patterns rather than hard-coded rules. First approach treating agent control as a learned scheduling problem, enabling proactive self-improving behavior through meta-learning from historical execution logs.
Interspeech 2026 challenge shifts audio AI evaluation from result-oriented to process-oriented reasoning quality using instance-level rubric-based evaluation. Champion agent integrated 40+ specialized audio tools achieving 69.83% Rubrics score. Emphasizes transparent reasoning over black-box performance metrics.
Comprehensive survey organizing agentic reasoning along three dimensions: foundational (planning, tool use, search), self-evolving (feedback, memory, adaptation), and collective multi-agent reasoning. Distinguishes in-context reasoning from post-training reasoning and provides unified taxonomy bridging thought and action across science, robotics, healthcare, and mathematics.