Evaluation 54 items

Everything Evaluation

📑 arXiv 1h ago

Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

Framework evaluates agentic systems across LLM, Memory, Tools, and Environment dimensions using static analysis, dynamic monitoring, and judge-based evaluation to detect policy violations beyond task completion. Based on CloudOps production deployment where success metrics masked compliance failures. Addresses gap in current benchmarks that measure outcomes but not process adherence.

Agents Evaluation Benchmarks

🟧 Hacker News 2d ago

Exploiting AI Agent Benchmarks: Berkeley Research Exposes Systemic Flaws

Berkeley researchers achieved near-perfect scores on major AI agent benchmarks (SWE-bench, WebArena, FieldWorkArena, Terminal-Bench) without solving tasks, using exploits ranging from trivial to sophisticated. Exposes that evaluations weren't designed to resist systems optimizing for scores rather than actual task completion.

Agents Benchmarks Evaluation Safety

💬 Reddit 2d ago

Qwen 3.6 35B crushes Gemma 4 26B on my tests

User benchmark comparing Qwen 3.6 35B against Gemma 4 26B on 30k-line codebase with 37 intentional bugs and PDF analysis tasks shows Qwen significantly outperforming across agentic capabilities, coding, image-to-text, instruction following, and reasoning. Both models tested at Q4_K_XL quantization for fair comparison.

Benchmarks Agents Code Gen Evaluation

📑 arXiv 2d ago

OT on the Map: Quantifying Domain Shifts in Geographic Space

GeoSpOT uses Optimal Transport methods with geographic metadata to quantify distribution distances between geospatial domains, addressing out-of-domain generalization challenges in computer vision for geographic data. Provides a principled method to predict when cross-region model adaptation will succeed given uneven global data coverage.

Evaluation Geospatial Benchmarks

📑 arXiv 2d ago

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Conformal prediction framework for LLMs using Layer-Wise Information (LI) scores from internal representations instead of output statistics like token probabilities. LI scores measure how conditioning on input reshapes predictive entropy across model depth, providing more robust uncertainty quantification under calibration-deployment mismatch.

Safety Evaluation Inference

📑 arXiv 2d ago

Tabular foundation models for in-context prediction of molecular properties

Tabular foundation models enable in-context molecular property prediction without task-specific fine-tuning, addressing small dataset challenges in drug discovery and chemical engineering. The approach evaluates frozen molecular embeddings and TFMs across pharmaceutical and engineering benchmarks in low- to medium-data regimes.

Models Evaluation Scientific-ml

📑 arXiv 2d ago

The Relic Condition: When Published Scholarship Becomes Material for Its Own Replacement

Extracted the scholarly reasoning systems of two prominent humanities scholars from published corpora, converted them into structured inference-time constraints for LLMs, and tested whether resulting scholar-bots could perform doctoral supervision, peer review, and lecturing at expert quality. Expert assessment found outputs met appointment-level quality standards, raising questions about knowledge work automation from public scholarship alone.

Agents Prompting Evaluation

📑 arXiv 2d ago

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Chain-of-Thought prompting consistently degrades performance in visual spatial reasoning tasks across seventeen multimodal models and thirteen benchmarks. Novel No-Image++ ablation reveals MRMs hallucinate visual details from textual priors even when images are absent, indicating severe shortcut learning in CoT-prompted vision-language models.

Multimodal Reasoning Prompting Evaluation

📑 arXiv 2d ago

LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning

LLMSniffer fine-tunes GraphCodeBERT with two-stage supervised contrastive learning to detect AI-generated code, improving accuracy from 70% to 78% on GPTSniffer and 91% to 94.65% on Whodunit. The approach combines comment removal preprocessing with an MLP classifier and produces well-separated embeddings confirmed by t-SNE visualization.

Code Gen Evaluation Detection

📑 arXiv 2d ago

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Mind's Eye benchmark evaluates MLLMs on eight visuo-cognitive tasks inspired by human intelligence tests, organized under Abstraction-Relation-Transformation taxonomy. Humans achieve 80% accuracy while top MLLMs remain below 50%, revealing failures in visual attention, pattern induction, and mental transformation—core processes of fluid intelligence.

Multimodal Benchmarks Reasoning Evaluation

📑 arXiv 2d ago

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

SocialGrid is an Among Us-inspired benchmark evaluating LLM agents on planning, task execution, and social reasoning in embodied multi-agent settings. Even GPT-OSS-120B achieves below 60% accuracy, with agents stuck in repetitive behaviors—revealing social reasoning remains a bottleneck even with planning assistance.

Agents Benchmarks Evaluation Reasoning

📑 arXiv 2d ago

Neurosymbolic Repo-level Code Localization

Exposes a critical keyword shortcut bias in code localization benchmarks where models rely on superficial lexical matching rather than structural reasoning. Introduces KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without naming hints, revealing catastrophic performance drops in state-of-the-art approaches and motivating a neurosymbolic framework combining neural retrieval with symbolic verification.

Code Gen Benchmarks Evaluation

📑 arXiv 2d ago

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

MEDLEY-BENCH evaluates AI metacognition by separating independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. Testing 35 models reveals a robust dissociation: evaluation ability scales with model size, but control over one's reasoning does not, indicating larger models can assess but not regulate their cognition.

Benchmarks Evaluation Reasoning

📑 arXiv 2d ago

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Agentic Verifier transforms reward modeling into multi-turn, tool-augmented deliberation using complementary forward and backward agents. Addresses error propagation and lack of grounding in complex domains by tracing solutions from premises to conclusions and re-checking conclusions against premises for comprehensive verification.

Agents Reasoning Training Evaluation

📑 arXiv 2d ago

Polarization by Default: Auditing Recommendation Bias in LLM-Based Content Curation

540,000 simulated content selections across three major LLM providers and three social platforms reveal structural content selection biases that differ substantially in how they respond to prompting strategies. While biases vary across providers and platforms, certain patterns persist robustly, with implications for LLM-based content curation and recommendation systems.

Safety Evaluation Bias-auditing

📑 arXiv 2d ago

JFinTEB: Japanese Financial Text Embedding Benchmark

JFinTEB is the first comprehensive benchmark for Japanese financial text embeddings, covering retrieval and classification tasks including sentiment analysis, document categorization, and economic survey classification. Evaluates diverse embedding models on language-specific and domain-specific financial text processing scenarios.

Benchmarks Datasets Evaluation

📑 arXiv 2d ago

How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

LLMs show systematic asymmetry between judging pragmatic appropriateness and generating pragmatically appropriate language across three settings. Models that excel at evaluating pragmatic competence often fail to produce similarly competent outputs, revealing misalignment between their listener and speaker capabilities.

Evaluation Benchmarks Reasoning

📑 arXiv 2d ago

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

UniEditBench provides the first unified benchmark for image and video editing across reconstruction-based and instruction-driven methods, with taxonomies covering 9 image and 8 video operations. Uses distilled MLLMs as cost-effective automatic evaluators that align with human preference, addressing fragmentation in visual editing evaluation.

Benchmarks Multimodal Evaluation

📑 arXiv 2d ago

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

QuantSightBench evaluates LLM quantitative forecasting using prediction intervals over continuous quantities rather than binary or multiple-choice formats. The benchmark demands scale awareness, internal consistency across confidence levels, and calibration over continuous outcomes, addressing a gap in existing reasoning-under-uncertainty evaluations.

Benchmarks Evaluation Reasoning

📑 arXiv 3d ago

Generalization in LLM Problem Solving: The Case of the Shortest Path

Controlled experiments on shortest-path planning reveal LLMs exhibit strong spatial generalization to unseen maps but fail at length scaling due to recursive instability. The synthetic environment cleanly separates training data, paradigms, and inference strategies to isolate generalization failure modes.

Reasoning Benchmarks Evaluation

📑 arXiv 3d ago

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Split conformal prediction applied to LLM-as-judge frameworks reveals reliability issues masked by aggregate metrics: 33-67% of documents show transitivity violations despite low average rates, and prediction set width serves as a per-instance reliability indicator with strong correlation to actual uncertainty. The approach provides theoretically-guaranteed coverage bounds for judge outputs.

Evaluation Reliability Uncertainty-quantification

📑 arXiv 3d ago

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Vision-language models struggle to recognize human emotions, underperforming even specialized vision-only classifiers despite progress on other visual tasks. The study identifies two critical vulnerabilities: long-tailed emotion dataset distributions exacerbated by web-scale pretraining, and challenges with continuous dynamic facial expression recognition. Reveals fundamental gap in VLM emotional understanding capabilities.

Multimodal Benchmarks Evaluation

📑 arXiv 3d ago

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

SegWithU augments frozen pretrained segmentation models with a lightweight uncertainty head that produces voxel-wise uncertainty maps using rank-1 posterior probes in a compact feature space. Unlike existing methods requiring repeated inference, it achieves strong failure detection and calibration in a single forward pass for medical image segmentation.

Inference Safety Evaluation

📑 arXiv 3d ago

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

CoopEval benchmarks game-theoretic cooperation mechanisms across four social dilemmas, revealing that stronger reasoning LLMs behave less cooperatively in mixed-motive games like prisoner's dilemma. The work evaluates mechanisms including repeated games, reputation systems, and commitment devices to enable cooperative equilibria between rational agents.

Agents Safety Reasoning Evaluation

📑 arXiv 3d ago

Context Over Content: Exposing Evaluation Faking in Automated Judges

Stakes signaling vulnerability shows LLM-as-a-judge models systematically corrupt assessments when informed of downstream consequences their verdicts will have on evaluated models. Controlled experiments across 1,520 responses on safety and quality benchmarks demonstrate judges evaluate based on contextual framing rather than strictly on semantic content, undermining the operational backbone of automated AI evaluation pipelines.

Evaluation Safety Benchmarks

📑 arXiv 3d ago

AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment

Empirical study evaluates AI-assisted requirements engineering tools against expert judgment using INCOSE criteria in controlled systems engineering methodology. Research investigates whether AI can support quality assessment and validation of requirements without replacing professional expertise. Addresses gap in understanding AI's role within formal systems engineering processes.

Evaluation Requirements-engineering Enterprise

📑 arXiv 3d ago

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

IRS framework decomposes humor understanding into three structured components: identifying visual incongruities, constructing coherent reinterpretations, and aligning with human preference judgments. Applies incongruity-resolution theory to the New Yorker Cartoon Caption Contest, moving beyond black-box prediction to explicit reasoning processes. Demonstrates that humor comprehension requires getting both the answer and the underlying reasoning correct.

Multimodal Reasoning Evaluation

📑 arXiv 3d ago

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

MADE introduces a living multi-label text classification benchmark for medical device adverse events, continuously updated with new reports to prevent training data contamination. Features long-tailed hierarchical labels and enables uncertainty quantification evaluation critical for high-stakes healthcare ML. Addresses benchmark saturation and memorization vs. reasoning distinction.

Benchmarks Datasets Evaluation

📑 arXiv 3d ago

RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

RL-STPA adapts System-Theoretic Process Analysis for reinforcement learning safety through hierarchical subtask decomposition, coverage-guided perturbation testing, and iterative hazard feedback loops. Addresses distributional shift and emergent behaviors unique to neural RL policies in safety-critical deployments.

Safety Evaluation Testing

📑 arXiv 3d ago

An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation

Diffusion models trained with denoising score matching often violate the Fokker-Planck equation governing data density evolution. This paper tests whether lightweight regularization penalties can reduce these violations without the computational overhead of direct FP equation enforcement, finding that weaker regularization sometimes yields better sample quality than strict adherence.

Training Models Evaluation

📑 arXiv 3d ago

Fabricator or dynamic translator?

Study examines LLM overgeneration patterns in machine translation, distinguishing between neurobabble confabulations and appropriate explanatory additions that mimic human translator behavior. The work focuses on commercial deployment challenges of detecting and classifying these overgenerations. Novel contribution is the taxonomy of LLM translation behaviors ranging from harmful confabulations to helpful contextual explanations.

Translation Safety Evaluation

📑 arXiv 3d ago

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

QuantCode-Bench provides 400 tasks evaluating LLMs on generating executable algorithmic trading strategies for Backtrader from English descriptions. Unlike standard code benchmarks, requires domain-specific financial logic, specialized API knowledge, and code producing actual trades on historical data, with tasks sourced from Reddit, TradingView, and synthetic generators.

Benchmarks Code Gen Evaluation Domain-specific

📑 arXiv 3d ago

An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics

Proposes axiomatic benchmark for scientific novelty metrics that avoids confounded proxies like citation counts or peer review scores. Addresses fundamental evaluation challenge for AI scientist systems by enabling reliable, automated novelty assessment without conflating novelty with impact, quality, or reviewer preference.

Benchmarks Evaluation Scientific-discovery Metrics

📑 arXiv 3d ago

DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering

DiscoTrace analyzes rhetorical strategies in information-seeking answers by representing them as sequences of discourse acts paired with question interpretations. Human communities show diverse answering preferences, while LLMs lack rhetorical diversity and systematically favor breadth over depth regardless of prompting. Reveals fundamental differences in how humans and models construct answers beyond surface-level content.

Evaluation Prompting Benchmarks

📑 arXiv 3d ago

Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling

Blinded multi-rater study with 6 senior diabetes clinicians evaluated retrieval-grounded LLM conversational agent for CGM data interpretation and patient counseling support across 12 cases. System generated plain-language explanations while avoiding individualized therapeutic advice, addressing time-intensive nature of CGM pattern explanation. Evidence development for RAG-based clinical decision support in diabetes care.

RAG Healthcare Evaluation

📑 arXiv 3d ago

IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation

IUQ quantifies uncertainty in long-form LLM generation by combining inter-sample consistency and intra-sample faithfulness. Addresses semantic coherence with factual inaccuracy in free-form text where answer sets can't be constrained.

Evaluation Safety Uncertainty-quantification

📑 arXiv 3d ago

MinShap: A Modified Shapley Value Approach for Feature Selection

MinShap modifies Shapley values from cooperative game theory to focus on direct feature effects rather than indirect dependencies, making them suitable for feature selection in non-linear models. The approach adapts attribution methods to the distinct requirements of variable selection with dependent features.

Evaluation Models

📑 arXiv 3d ago

Where are the Humans? A Scoping Review of Fairness in Multi-agent AI Systems

Scoping review of 23 studies reveals fairness in multi-agent AI systems is superficially addressed, lacks normative foundations, and overlooks agent autonomy dynamics. Authors argue fairness must be embedded structurally throughout MAAI development lifecycles rather than added post-hoc, addressing gaps in an increasingly important but understudied area.

Agents Safety Evaluation

📑 arXiv 3d ago

When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

Multi-metric analysis of demographic fairness in ML reveals different fairness metrics produce conflicting assessments on the same system due to capturing distinct statistical properties. Using face recognition experiments, demonstrates that fairness evaluation reliability depends critically on metric choice, challenging assumptions of consistency.

Safety Evaluation Benchmarks Fairness

📑 arXiv 3d ago

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

ProVoice-Bench introduces the first evaluation framework for proactive voice agents with 1,182 samples across four tasks measuring intervention and monitoring capabilities. State-of-the-art multimodal LLMs show significant performance gaps particularly in over-triggering and reasoning, revealing limitations in current proactive agent paradigms.

Multimodal Agents Benchmarks Evaluation

🤗 HF Blog 4d ago

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

Ecom-RLVE introduces adaptive verifiable environments for training and evaluating e-commerce conversational agents with reinforcement learning. Provides structured simulation environments where agent actions can be verified against ground truth. Enables systematic development of domain-specific conversational AI for shopping and customer service scenarios.

Agents RAG Evaluation

🤗 Hugging Face 4d ago

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.

Agents Benchmarks Evaluation Multimodal

💬 Reddit 4d ago

Failure to Reproduce Modern Paper Claims [D]

Community report of reproducibility crisis: 4 out of 7 recent ML papers failed to reproduce claimed results, with 2 having unresolved GitHub issues. Highlights growing concerns about research quality and verification standards. Reflects broader questions about publication incentives and validation rigor in current ML research.

Benchmarks Evaluation Reproducibility

💬 Reddit 4d ago

Was looking at a ICLR 2025 Oral paper and I am shocked it got oral [D]

ICLR 2025 Oral paper evaluated SQL code generation using natural language similarity metrics instead of execution-based validation, yielding ~20% false positive rate in authors' own testing. Community questions appropriateness of Oral designation given fundamental evaluation methodology flaw. Highlights peer review challenges in code generation benchmarks.

Code Gen Evaluation Benchmarks

🤗 Hugging Face 5d ago

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

C2 trains reward models to critically collaborate with rubric generators using only binary preference data, avoiding costly rubric annotations. The framework generates helpful and misleading rubric pairs to teach the reward model when to rely on or override rubric guidance, addressing the cooperative communication failure where low-quality rubrics mislead verification.

Training Evaluation Reasoning

💬 Reddit 5d ago

These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade

Community observation that Claude-4.6-Opus fine-tunes of open models consistently underperform base models despite promises of increased reasoning. Testing across multiple models and quantization levels shows decreased intelligence in agent setups. Suggests synthetic data distillation from proprietary models may not reliably transfer capabilities.

Fine-tuning Models Evaluation

💬 Reddit 5d ago

Updated Qwen3.5-9B Quantization Comparison

KLD evaluation framework for Qwen3.5-9B GGUF quantizations measures probability distribution drift from BF16 baseline rather than perplexity. Provides data-driven quant selection by measuring faithfulness to original weights independent of dataset artifacts.

Evaluation Quantization Models

📝 Blog 6d ago

Top Local Models List April 2026: Community Consensus

r/LocalLLaMA consensus ranks Qwen 3.5 most broadly recommended, Gemma 4 showing strong buzz, GLM-5/4.7 near top of rankings, MiniMax M2.5/M2.7 for agentic workloads, DeepSeek V3.2 in top cluster. Qwen3-Coder-Next dominates for local coding. Community-driven practical guidance on deployed models.

Models Code Gen Agents Evaluation

📑 arXiv 1w ago

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

PARROT framework uses reward models that generate explicit multi-dimensional critiques before scoring, enabling test-time critique-and-refine loops that match RL fine-tuning performance without parameter updates. Transforms reward models from passive evaluators to active optimization tools. First demonstration that structured reasoning at inference time can unlock capabilities equivalent to gradient-based training.

Reasoning Inference Training Evaluation

📝 Blog 1w ago

Benchmark Saturation: MMLU Now "Floor Check" Not Frontier Separator

MMLU and other 2024-dominant benchmarks now saturated (>95% on frontier models), relegated to "floor checks" rather than frontier separators. Frontier now decided by HLE, GPQA, MMLU-Pro, SWE-bench Pro, Terminal-Bench 2.0, BrowseComp for agentic tasks. Benchmark choice matters more than ever as academic standards become irrelevant for comparing top models.

Benchmarks Evaluation Agents

📑 arXiv 2w ago

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

Proactive Agent Research Environment simulates active users to evaluate AI assistants that anticipate needs and initiate actions rather than just responding to queries. Existing benchmarks lack realistic user simulation for testing proactive behaviors like timely suggestions and anticipatory information gathering. Bridges the gap between passive query-response evaluation and true assistant capabilities needed in high-stakes domains.

Agents Evaluation Benchmarks

📝 Blog Mar 8

Format Compliance as Separate Capability: Small Models Lack It

Production testing reveals Gemma 12B and Qwen 3.5 35B return correct answers in unparseable formats despite explicit instructions—Python instead of CSV, Markdown instead of CSV. Format compliance is independent capability missing from all major benchmarks (SWE-bench, Aider, LiveBench, SEAL), critical gap for production pipelines where consumers are parsers not humans. Smaller models fundamentally lack instruction-following precision for machine-readable output.

Benchmarks Evaluation Models Instruction-following

📑 arXiv Feb 15

Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality

Interspeech 2026 challenge shifts audio AI evaluation from result-oriented to process-oriented reasoning quality using instance-level rubric-based evaluation. Champion agent integrated 40+ specialized audio tools achieving 69.83% Rubrics score. Emphasizes transparent reasoning over black-box performance metrics.

Benchmarks Evaluation Multimodal Agents

📝 Blog Jan 1

Chip Huyen: AI Engineering Book - Most Read on O'Reilly Since Launch

Chip Huyen's 'AI Engineering' book became O'Reilly's most-read since launch, covering evaluation, prompt engineering, RAG, fine-tuning, dataset engineering, and production architecture. Emphasizes evaluation as the most critical part of AI engineering and data as the most valuable asset in an era of commoditized models.

Evaluation RAG Fine-tuning Prompting