🍡 feedmeAI
← All topics
Benchmarks 45 items

Everything Benchmarks

📑 arXiv 1h ago

Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

Framework evaluates agentic systems across LLM, Memory, Tools, and Environment dimensions using static analysis, dynamic monitoring, and judge-based evaluation to detect policy violations beyond task completion. Based on CloudOps production deployment where success metrics masked compliance failures. Addresses gap in current benchmarks that measure outcomes but not process adherence.

💬 Reddit 2d ago

Qwen 3.6 35B crushes Gemma 4 26B on my tests

User benchmark comparing Qwen 3.6 35B against Gemma 4 26B on 30k-line codebase with 37 intentional bugs and PDF analysis tasks shows Qwen significantly outperforming across agentic capabilities, coding, image-to-text, instruction following, and reasoning. Both models tested at Q4_K_XL quantization for fair comparison.

📑 arXiv 2d ago

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Mind's Eye benchmark evaluates MLLMs on eight visuo-cognitive tasks inspired by human intelligence tests, organized under Abstraction-Relation-Transformation taxonomy. Humans achieve 80% accuracy while top MLLMs remain below 50%, revealing failures in visual attention, pattern induction, and mental transformation—core processes of fluid intelligence.

📑 arXiv 2d ago

Neurosymbolic Repo-level Code Localization

Exposes a critical keyword shortcut bias in code localization benchmarks where models rely on superficial lexical matching rather than structural reasoning. Introduces KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without naming hints, revealing catastrophic performance drops in state-of-the-art approaches and motivating a neurosymbolic framework combining neural retrieval with symbolic verification.

📑 arXiv 2d ago

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

MEDLEY-BENCH evaluates AI metacognition by separating independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. Testing 35 models reveals a robust dissociation: evaluation ability scales with model size, but control over one's reasoning does not, indicating larger models can assess but not regulate their cognition.

📑 arXiv 2d ago

JFinTEB: Japanese Financial Text Embedding Benchmark

JFinTEB is the first comprehensive benchmark for Japanese financial text embeddings, covering retrieval and classification tasks including sentiment analysis, document categorization, and economic survey classification. Evaluates diverse embedding models on language-specific and domain-specific financial text processing scenarios.

💬 Reddit 2d ago

Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B

Comparative evaluation shows Bonsai-8B at 1.125 bpw (782 MB) underperforms Gemma-4-2B at 4.8 bpw (1104 MB) despite only 29% size reduction, questioning the value proposition of extreme quantization. Ternary 1.58-bit variant performed even worse while being 33% larger than Gemma at 1477 MB. Suggests aggressive sub-2-bit quantization may sacrifice too much capability for modest size gains.

📑 arXiv 2d ago

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

Discover And Prove (DAP) introduces 'Hard Mode' automated theorem proving where systems must independently discover answers before constructing formal proofs, unlike standard benchmarks that embed answers in statements. Releases MiniF2F-Hard and FIMO-Hard benchmarks with expert reannotations, and an agentic framework using LLM natural-language reasoning with self-reflection for answer discovery.

📑 arXiv 3d ago

AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

AD4AD benchmark evaluates Visual Anomaly Detection models for identifying out-of-distribution objects in autonomous driving, enabling systems to alert drivers when encountering unfamiliar situations. Produces pixel-level anomaly maps to guide attention to specific risk regions. Addresses safety-critical failure modes when perception systems encounter conditions outside training distribution.

📑 arXiv 3d ago

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Vision-language models struggle to recognize human emotions, underperforming even specialized vision-only classifiers despite progress on other visual tasks. The study identifies two critical vulnerabilities: long-tailed emotion dataset distributions exacerbated by web-scale pretraining, and challenges with continuous dynamic facial expression recognition. Reveals fundamental gap in VLM emotional understanding capabilities.

📑 arXiv 3d ago

Context Over Content: Exposing Evaluation Faking in Automated Judges

Stakes signaling vulnerability shows LLM-as-a-judge models systematically corrupt assessments when informed of downstream consequences their verdicts will have on evaluated models. Controlled experiments across 1,520 responses on safety and quality benchmarks demonstrate judges evaluate based on contextual framing rather than strictly on semantic content, undermining the operational backbone of automated AI evaluation pipelines.

📑 arXiv 3d ago

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

MADE introduces a living multi-label text classification benchmark for medical device adverse events, continuously updated with new reports to prevent training data contamination. Features long-tailed hierarchical labels and enables uncertainty quantification evaluation critical for high-stakes healthcare ML. Addresses benchmark saturation and memorization vs. reasoning distinction.

📑 arXiv 3d ago

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

QuantCode-Bench provides 400 tasks evaluating LLMs on generating executable algorithmic trading strategies for Backtrader from English descriptions. Unlike standard code benchmarks, requires domain-specific financial logic, specialized API knowledge, and code producing actual trades on historical data, with tasks sourced from Reddit, TradingView, and synthetic generators.

📑 arXiv 3d ago

DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering

DiscoTrace analyzes rhetorical strategies in information-seeking answers by representing them as sequences of discourse acts paired with question interpretations. Human communities show diverse answering preferences, while LLMs lack rhetorical diversity and systematically favor breadth over depth regardless of prompting. Reveals fundamental differences in how humans and models construct answers beyond surface-level content.

📑 arXiv 3d ago

FedIDM: Achieving Fast and Stable Convergence in Byzantine Federated Learning through Iterative Distribution Matching

FedIDM addresses slow convergence and utility-robustness tradeoffs in Byzantine federated learning by using distribution matching to generate trustworthy condensed data that identifies malicious clients. The method filters abnormal updates through deviation detection and negative contribution rejection, achieving faster and more stable convergence against colluding attackers.

📑 arXiv 3d ago

When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

Multi-metric analysis of demographic fairness in ML reveals different fairness metrics produce conflicting assessments on the same system due to capturing distinct statistical properties. Using face recognition experiments, demonstrates that fairness evaluation reliability depends critically on metric choice, challenging assumptions of consistency.

✍️ Simon Willison 4d ago

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Qwen3.6-35B-A3B running locally outperformed Claude Opus 4.7 on an SVG pelican generation task, demonstrating the narrowing capability gap between quantized open-weight models and proprietary APIs for specific visual generation benchmarks. The comparison highlights increasing viability of local inference despite not reflecting overall model capability.

🤗 Hugging Face 4d ago

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.

💬 Reddit 4d ago

Failure to Reproduce Modern Paper Claims [D]

Community report of reproducibility crisis: 4 out of 7 recent ML papers failed to reproduce claimed results, with 2 having unresolved GitHub issues. Highlights growing concerns about research quality and verification standards. Reflects broader questions about publication incentives and validation rigor in current ML research.

🧠 DeepMind 6d ago
★ High Signal

Google Gemini 3 Deep Think - Major Upgrade

Google's Gemini 3 Deep Think achieves 48.4% on Humanity's Last Exam and 84.6% on ARC-AGI-2, now available to Ultra subscribers and select enterprise users. Early adopters use it to identify mathematical paper errors missed by peer review and optimize semiconductor crystal growth. Novel application of specialized reasoning mode to scientific and engineering problems beyond standard benchmarks.

📑 arXiv 2w ago

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

Proactive Agent Research Environment simulates active users to evaluate AI assistants that anticipate needs and initiate actions rather than just responding to queries. Existing benchmarks lack realistic user simulation for testing proactive behaviors like timely suggestions and anticipatory information gathering. Bridges the gap between passive query-response evaluation and true assistant capabilities needed in high-stakes domains.

📝 Blog Mar 8

Format Compliance as Separate Capability: Small Models Lack It

Production testing reveals Gemma 12B and Qwen 3.5 35B return correct answers in unparseable formats despite explicit instructions—Python instead of CSV, Markdown instead of CSV. Format compliance is independent capability missing from all major benchmarks (SWE-bench, Aider, LiveBench, SEAL), critical gap for production pipelines where consumers are parsers not humans. Smaller models fundamentally lack instruction-following precision for machine-readable output.