Framework evaluates agentic systems across LLM, Memory, Tools, and Environment dimensions using static analysis, dynamic monitoring, and judge-based evaluation to detect policy violations beyond task completion. Based on CloudOps production deployment where success metrics masked compliance failures. Addresses gap in current benchmarks that measure outcomes but not process adherence.
Multi-agent LLM systems spontaneously develop power law distributions in knowledge and influence, mirroring human intellectual hierarchies. Agent societies exhibit emergent specialization and social stratification. First empirical evidence of collective social dynamics beyond individual agent capabilities.
GUIDE separates lightweight acting model for real-time spacecraft control from offline reflection that updates a 'playbook' from prior trajectories, demonstrating LLMs can adapt operational strategies without weight updates in safety-critical domains. Shows context evolution in LLM agents functions as policy search over structured decision rules in deployment-constrained environments.
Reveals 'Read-Write Asymmetry' where LLMs interpret ASCII layouts well but struggle to produce them, showing that training on layout construction (TextβASCII) improves spatial reasoning even without producing ASCII at inference. Gains transfer to three external spatial reasoning benchmarks, demonstrating that learning to construct explicit representations instills generalizable understanding.
Multi-Agent Reflexion uses diverse reasoning personas with separate judge model to synthesize critiques, improving HotPotQA by 3 points and HumanEval by 6.2 points. Separates acting, diagnosing, critiquing, and aggregating to reduce shared blind spots in single-agent self-reflection. Addresses systematic limitation where solo agents repeat misconceptions without external correction signals.
Neurosymbolic architecture grounds AI agents in domain ontologies for regulated industries, evaluated across 600 runs in 5 sectors including Vietnamese-language domains. Ensures agent reasoning aligns with compliance requirements and domain constraints. Bridges symbolic knowledge representation with neural reasoning for safety-critical enterprise deployment.