Framework evaluates agentic systems across LLM, Memory, Tools, and Environment dimensions using static analysis, dynamic monitoring, and judge-based evaluation to detect policy violations beyond task completion. Based on CloudOps production deployment where success metrics masked compliance failures. Addresses gap in current benchmarks that measure outcomes but not process adherence.
Berkeley researchers achieved near-perfect scores on major AI agent benchmarks (SWE-bench, WebArena, FieldWorkArena, Terminal-Bench) without solving tasks, using exploits ranging from trivial to sophisticated. Exposes that evaluations weren't designed to resist systems optimizing for scores rather than actual task completion.
User benchmark comparing Qwen 3.6 35B against Gemma 4 26B on 30k-line codebase with 37 intentional bugs and PDF analysis tasks shows Qwen significantly outperforming across agentic capabilities, coding, image-to-text, instruction following, and reasoning. Both models tested at Q4_K_XL quantization for fair comparison.
GeoSpOT uses Optimal Transport methods with geographic metadata to quantify distribution distances between geospatial domains, addressing out-of-domain generalization challenges in computer vision for geographic data. Provides a principled method to predict when cross-region model adaptation will succeed given uneven global data coverage.
Conformal prediction framework for LLMs using Layer-Wise Information (LI) scores from internal representations instead of output statistics like token probabilities. LI scores measure how conditioning on input reshapes predictive entropy across model depth, providing more robust uncertainty quantification under calibration-deployment mismatch.
Tabular foundation models enable in-context molecular property prediction without task-specific fine-tuning, addressing small dataset challenges in drug discovery and chemical engineering. The approach evaluates frozen molecular embeddings and TFMs across pharmaceutical and engineering benchmarks in low- to medium-data regimes.
Extracted the scholarly reasoning systems of two prominent humanities scholars from published corpora, converted them into structured inference-time constraints for LLMs, and tested whether resulting scholar-bots could perform doctoral supervision, peer review, and lecturing at expert quality. Expert assessment found outputs met appointment-level quality standards, raising questions about knowledge work automation from public scholarship alone.
Chain-of-Thought prompting consistently degrades performance in visual spatial reasoning tasks across seventeen multimodal models and thirteen benchmarks. Novel No-Image++ ablation reveals MRMs hallucinate visual details from textual priors even when images are absent, indicating severe shortcut learning in CoT-prompted vision-language models.
LLMSniffer fine-tunes GraphCodeBERT with two-stage supervised contrastive learning to detect AI-generated code, improving accuracy from 70% to 78% on GPTSniffer and 91% to 94.65% on Whodunit. The approach combines comment removal preprocessing with an MLP classifier and produces well-separated embeddings confirmed by t-SNE visualization.
Mind's Eye benchmark evaluates MLLMs on eight visuo-cognitive tasks inspired by human intelligence tests, organized under Abstraction-Relation-Transformation taxonomy. Humans achieve 80% accuracy while top MLLMs remain below 50%, revealing failures in visual attention, pattern induction, and mental transformation—core processes of fluid intelligence.
SocialGrid is an Among Us-inspired benchmark evaluating LLM agents on planning, task execution, and social reasoning in embodied multi-agent settings. Even GPT-OSS-120B achieves below 60% accuracy, with agents stuck in repetitive behaviors—revealing social reasoning remains a bottleneck even with planning assistance.
Exposes a critical keyword shortcut bias in code localization benchmarks where models rely on superficial lexical matching rather than structural reasoning. Introduces KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without naming hints, revealing catastrophic performance drops in state-of-the-art approaches and motivating a neurosymbolic framework combining neural retrieval with symbolic verification.
MEDLEY-BENCH evaluates AI metacognition by separating independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. Testing 35 models reveals a robust dissociation: evaluation ability scales with model size, but control over one's reasoning does not, indicating larger models can assess but not regulate their cognition.
Agentic Verifier transforms reward modeling into multi-turn, tool-augmented deliberation using complementary forward and backward agents. Addresses error propagation and lack of grounding in complex domains by tracing solutions from premises to conclusions and re-checking conclusions against premises for comprehensive verification.
540,000 simulated content selections across three major LLM providers and three social platforms reveal structural content selection biases that differ substantially in how they respond to prompting strategies. While biases vary across providers and platforms, certain patterns persist robustly, with implications for LLM-based content curation and recommendation systems.
JFinTEB is the first comprehensive benchmark for Japanese financial text embeddings, covering retrieval and classification tasks including sentiment analysis, document categorization, and economic survey classification. Evaluates diverse embedding models on language-specific and domain-specific financial text processing scenarios.
LLMs show systematic asymmetry between judging pragmatic appropriateness and generating pragmatically appropriate language across three settings. Models that excel at evaluating pragmatic competence often fail to produce similarly competent outputs, revealing misalignment between their listener and speaker capabilities.
UniEditBench provides the first unified benchmark for image and video editing across reconstruction-based and instruction-driven methods, with taxonomies covering 9 image and 8 video operations. Uses distilled MLLMs as cost-effective automatic evaluators that align with human preference, addressing fragmentation in visual editing evaluation.
QuantSightBench evaluates LLM quantitative forecasting using prediction intervals over continuous quantities rather than binary or multiple-choice formats. The benchmark demands scale awareness, internal consistency across confidence levels, and calibration over continuous outcomes, addressing a gap in existing reasoning-under-uncertainty evaluations.
Controlled experiments on shortest-path planning reveal LLMs exhibit strong spatial generalization to unseen maps but fail at length scaling due to recursive instability. The synthetic environment cleanly separates training data, paradigms, and inference strategies to isolate generalization failure modes.
Split conformal prediction applied to LLM-as-judge frameworks reveals reliability issues masked by aggregate metrics: 33-67% of documents show transitivity violations despite low average rates, and prediction set width serves as a per-instance reliability indicator with strong correlation to actual uncertainty. The approach provides theoretically-guaranteed coverage bounds for judge outputs.
Vision-language models struggle to recognize human emotions, underperforming even specialized vision-only classifiers despite progress on other visual tasks. The study identifies two critical vulnerabilities: long-tailed emotion dataset distributions exacerbated by web-scale pretraining, and challenges with continuous dynamic facial expression recognition. Reveals fundamental gap in VLM emotional understanding capabilities.
SegWithU augments frozen pretrained segmentation models with a lightweight uncertainty head that produces voxel-wise uncertainty maps using rank-1 posterior probes in a compact feature space. Unlike existing methods requiring repeated inference, it achieves strong failure detection and calibration in a single forward pass for medical image segmentation.
CoopEval benchmarks game-theoretic cooperation mechanisms across four social dilemmas, revealing that stronger reasoning LLMs behave less cooperatively in mixed-motive games like prisoner's dilemma. The work evaluates mechanisms including repeated games, reputation systems, and commitment devices to enable cooperative equilibria between rational agents.
Stakes signaling vulnerability shows LLM-as-a-judge models systematically corrupt assessments when informed of downstream consequences their verdicts will have on evaluated models. Controlled experiments across 1,520 responses on safety and quality benchmarks demonstrate judges evaluate based on contextual framing rather than strictly on semantic content, undermining the operational backbone of automated AI evaluation pipelines.
Empirical study evaluates AI-assisted requirements engineering tools against expert judgment using INCOSE criteria in controlled systems engineering methodology. Research investigates whether AI can support quality assessment and validation of requirements without replacing professional expertise. Addresses gap in understanding AI's role within formal systems engineering processes.
IRS framework decomposes humor understanding into three structured components: identifying visual incongruities, constructing coherent reinterpretations, and aligning with human preference judgments. Applies incongruity-resolution theory to the New Yorker Cartoon Caption Contest, moving beyond black-box prediction to explicit reasoning processes. Demonstrates that humor comprehension requires getting both the answer and the underlying reasoning correct.
MADE introduces a living multi-label text classification benchmark for medical device adverse events, continuously updated with new reports to prevent training data contamination. Features long-tailed hierarchical labels and enables uncertainty quantification evaluation critical for high-stakes healthcare ML. Addresses benchmark saturation and memorization vs. reasoning distinction.
RL-STPA adapts System-Theoretic Process Analysis for reinforcement learning safety through hierarchical subtask decomposition, coverage-guided perturbation testing, and iterative hazard feedback loops. Addresses distributional shift and emergent behaviors unique to neural RL policies in safety-critical deployments.
Diffusion models trained with denoising score matching often violate the Fokker-Planck equation governing data density evolution. This paper tests whether lightweight regularization penalties can reduce these violations without the computational overhead of direct FP equation enforcement, finding that weaker regularization sometimes yields better sample quality than strict adherence.
Study examines LLM overgeneration patterns in machine translation, distinguishing between neurobabble confabulations and appropriate explanatory additions that mimic human translator behavior. The work focuses on commercial deployment challenges of detecting and classifying these overgenerations. Novel contribution is the taxonomy of LLM translation behaviors ranging from harmful confabulations to helpful contextual explanations.
QuantCode-Bench provides 400 tasks evaluating LLMs on generating executable algorithmic trading strategies for Backtrader from English descriptions. Unlike standard code benchmarks, requires domain-specific financial logic, specialized API knowledge, and code producing actual trades on historical data, with tasks sourced from Reddit, TradingView, and synthetic generators.
Proposes axiomatic benchmark for scientific novelty metrics that avoids confounded proxies like citation counts or peer review scores. Addresses fundamental evaluation challenge for AI scientist systems by enabling reliable, automated novelty assessment without conflating novelty with impact, quality, or reviewer preference.
DiscoTrace analyzes rhetorical strategies in information-seeking answers by representing them as sequences of discourse acts paired with question interpretations. Human communities show diverse answering preferences, while LLMs lack rhetorical diversity and systematically favor breadth over depth regardless of prompting. Reveals fundamental differences in how humans and models construct answers beyond surface-level content.
Blinded multi-rater study with 6 senior diabetes clinicians evaluated retrieval-grounded LLM conversational agent for CGM data interpretation and patient counseling support across 12 cases. System generated plain-language explanations while avoiding individualized therapeutic advice, addressing time-intensive nature of CGM pattern explanation. Evidence development for RAG-based clinical decision support in diabetes care.
IUQ quantifies uncertainty in long-form LLM generation by combining inter-sample consistency and intra-sample faithfulness. Addresses semantic coherence with factual inaccuracy in free-form text where answer sets can't be constrained.
MinShap modifies Shapley values from cooperative game theory to focus on direct feature effects rather than indirect dependencies, making them suitable for feature selection in non-linear models. The approach adapts attribution methods to the distinct requirements of variable selection with dependent features.
Scoping review of 23 studies reveals fairness in multi-agent AI systems is superficially addressed, lacks normative foundations, and overlooks agent autonomy dynamics. Authors argue fairness must be embedded structurally throughout MAAI development lifecycles rather than added post-hoc, addressing gaps in an increasingly important but understudied area.
Multi-metric analysis of demographic fairness in ML reveals different fairness metrics produce conflicting assessments on the same system due to capturing distinct statistical properties. Using face recognition experiments, demonstrates that fairness evaluation reliability depends critically on metric choice, challenging assumptions of consistency.
ProVoice-Bench introduces the first evaluation framework for proactive voice agents with 1,182 samples across four tasks measuring intervention and monitoring capabilities. State-of-the-art multimodal LLMs show significant performance gaps particularly in over-triggering and reasoning, revealing limitations in current proactive agent paradigms.
Ecom-RLVE introduces adaptive verifiable environments for training and evaluating e-commerce conversational agents with reinforcement learning. Provides structured simulation environments where agent actions can be verified against ground truth. Enables systematic development of domain-specific conversational AI for shopping and customer service scenarios.
DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.
Community report of reproducibility crisis: 4 out of 7 recent ML papers failed to reproduce claimed results, with 2 having unresolved GitHub issues. Highlights growing concerns about research quality and verification standards. Reflects broader questions about publication incentives and validation rigor in current ML research.
ICLR 2025 Oral paper evaluated SQL code generation using natural language similarity metrics instead of execution-based validation, yielding ~20% false positive rate in authors' own testing. Community questions appropriateness of Oral designation given fundamental evaluation methodology flaw. Highlights peer review challenges in code generation benchmarks.
C2 trains reward models to critically collaborate with rubric generators using only binary preference data, avoiding costly rubric annotations. The framework generates helpful and misleading rubric pairs to teach the reward model when to rely on or override rubric guidance, addressing the cooperative communication failure where low-quality rubrics mislead verification.
Community observation that Claude-4.6-Opus fine-tunes of open models consistently underperform base models despite promises of increased reasoning. Testing across multiple models and quantization levels shows decreased intelligence in agent setups. Suggests synthetic data distillation from proprietary models may not reliably transfer capabilities.
KLD evaluation framework for Qwen3.5-9B GGUF quantizations measures probability distribution drift from BF16 baseline rather than perplexity. Provides data-driven quant selection by measuring faithfulness to original weights independent of dataset artifacts.
r/LocalLLaMA consensus ranks Qwen 3.5 most broadly recommended, Gemma 4 showing strong buzz, GLM-5/4.7 near top of rankings, MiniMax M2.5/M2.7 for agentic workloads, DeepSeek V3.2 in top cluster. Qwen3-Coder-Next dominates for local coding. Community-driven practical guidance on deployed models.
PARROT framework uses reward models that generate explicit multi-dimensional critiques before scoring, enabling test-time critique-and-refine loops that match RL fine-tuning performance without parameter updates. Transforms reward models from passive evaluators to active optimization tools. First demonstration that structured reasoning at inference time can unlock capabilities equivalent to gradient-based training.
MMLU and other 2024-dominant benchmarks now saturated (>95% on frontier models), relegated to "floor checks" rather than frontier separators. Frontier now decided by HLE, GPQA, MMLU-Pro, SWE-bench Pro, Terminal-Bench 2.0, BrowseComp for agentic tasks. Benchmark choice matters more than ever as academic standards become irrelevant for comparing top models.
Proactive Agent Research Environment simulates active users to evaluate AI assistants that anticipate needs and initiate actions rather than just responding to queries. Existing benchmarks lack realistic user simulation for testing proactive behaviors like timely suggestions and anticipatory information gathering. Bridges the gap between passive query-response evaluation and true assistant capabilities needed in high-stakes domains.
Production testing reveals Gemma 12B and Qwen 3.5 35B return correct answers in unparseable formats despite explicit instructions—Python instead of CSV, Markdown instead of CSV. Format compliance is independent capability missing from all major benchmarks (SWE-bench, Aider, LiveBench, SEAL), critical gap for production pipelines where consumers are parsers not humans. Smaller models fundamentally lack instruction-following precision for machine-readable output.
Interspeech 2026 challenge shifts audio AI evaluation from result-oriented to process-oriented reasoning quality using instance-level rubric-based evaluation. Champion agent integrated 40+ specialized audio tools achieving 69.83% Rubrics score. Emphasizes transparent reasoning over black-box performance metrics.
Chip Huyen's 'AI Engineering' book became O'Reilly's most-read since launch, covering evaluation, prompt engineering, RAG, fine-tuning, dataset engineering, and production architecture. Emphasizes evaluation as the most critical part of AI engineering and data as the most valuable asset in an era of commoditized models.