Framework evaluates agentic systems across LLM, Memory, Tools, and Environment dimensions using static analysis, dynamic monitoring, and judge-based evaluation to detect policy violations beyond task completion. Based on CloudOps production deployment where success metrics masked compliance failures. Addresses gap in current benchmarks that measure outcomes but not process adherence.
Visual overview of AI industry trends and metrics entering 2026. Provides data-driven perspective on current state of the field.
Benchmark comparing Claude and Gemini on the laden knight's tour problem, a weighted variant requiring optimal pathfinding with accumulating costs. Tests coding agents on combinatorial optimization task combining movement constraints with dynamic cost calculation.
Berkeley researchers achieved near-perfect scores on major AI agent benchmarks (SWE-bench, WebArena, FieldWorkArena, Terminal-Bench) without solving tasks, using exploits ranging from trivial to sophisticated. Exposes that evaluations weren't designed to resist systems optimizing for scores rather than actual task completion.
User benchmark comparing Qwen 3.6 35B against Gemma 4 26B on 30k-line codebase with 37 intentional bugs and PDF analysis tasks shows Qwen significantly outperforming across agentic capabilities, coding, image-to-text, instruction following, and reasoning. Both models tested at Q4_K_XL quantization for fair comparison.
GeoSpOT uses Optimal Transport methods with geographic metadata to quantify distribution distances between geospatial domains, addressing out-of-domain generalization challenges in computer vision for geographic data. Provides a principled method to predict when cross-region model adaptation will succeed given uneven global data coverage.
Mixed precision and floating-point settings cause ~2.4× training time variation in distributed deep learning, but existing predictors ignore precision and incur up to 147.85% MAPE. This work proposes a precision-aware predictor that accounts for mixed precision configurations to accurately forecast distributed training times for resource allocation and scheduling.
Mind's Eye benchmark evaluates MLLMs on eight visuo-cognitive tasks inspired by human intelligence tests, organized under Abstraction-Relation-Transformation taxonomy. Humans achieve 80% accuracy while top MLLMs remain below 50%, revealing failures in visual attention, pattern induction, and mental transformation—core processes of fluid intelligence.
SocialGrid is an Among Us-inspired benchmark evaluating LLM agents on planning, task execution, and social reasoning in embodied multi-agent settings. Even GPT-OSS-120B achieves below 60% accuracy, with agents stuck in repetitive behaviors—revealing social reasoning remains a bottleneck even with planning assistance.
Exposes a critical keyword shortcut bias in code localization benchmarks where models rely on superficial lexical matching rather than structural reasoning. Introduces KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without naming hints, revealing catastrophic performance drops in state-of-the-art approaches and motivating a neurosymbolic framework combining neural retrieval with symbolic verification.
MEDLEY-BENCH evaluates AI metacognition by separating independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. Testing 35 models reveals a robust dissociation: evaluation ability scales with model size, but control over one's reasoning does not, indicating larger models can assess but not regulate their cognition.
ReactBench reveals fundamental limitations in MLLMs' structural reasoning by testing them on chemical reaction diagrams with branching paths, converging flows, and cyclic dependencies. Existing models degrade sharply on topological structures despite excelling at individual visual elements, exposing a gap that semantic-focused benchmarks miss.
JFinTEB is the first comprehensive benchmark for Japanese financial text embeddings, covering retrieval and classification tasks including sentiment analysis, document categorization, and economic survey classification. Evaluates diverse embedding models on language-specific and domain-specific financial text processing scenarios.
LLMs show systematic asymmetry between judging pragmatic appropriateness and generating pragmatically appropriate language across three settings. Models that excel at evaluating pragmatic competence often fail to produce similarly competent outputs, revealing misalignment between their listener and speaker capabilities.
UniEditBench provides the first unified benchmark for image and video editing across reconstruction-based and instruction-driven methods, with taxonomies covering 9 image and 8 video operations. Uses distilled MLLMs as cost-effective automatic evaluators that align with human preference, addressing fragmentation in visual editing evaluation.
Comparative evaluation shows Bonsai-8B at 1.125 bpw (782 MB) underperforms Gemma-4-2B at 4.8 bpw (1104 MB) despite only 29% size reduction, questioning the value proposition of extreme quantization. Ternary 1.58-bit variant performed even worse while being 33% larger than Gemma at 1477 MB. Suggests aggressive sub-2-bit quantization may sacrifice too much capability for modest size gains.
QuantSightBench evaluates LLM quantitative forecasting using prediction intervals over continuous quantities rather than binary or multiple-choice formats. The benchmark demands scale awareness, internal consistency across confidence levels, and calibration over continuous outcomes, addressing a gap in existing reasoning-under-uncertainty evaluations.
DPrivBench evaluates whether LLMs can automate differential privacy reasoning by testing if they can verify whether functions satisfy stated DP guarantees. The benchmark covers diverse DP topics and difficulty levels while resisting trivial pattern matching, addressing the expert-level barrier that prevents non-experts from designing DP algorithms.
Discover And Prove (DAP) introduces 'Hard Mode' automated theorem proving where systems must independently discover answers before constructing formal proofs, unlike standard benchmarks that embed answers in statements. Releases MiniF2F-Hard and FIMO-Hard benchmarks with expert reannotations, and an agentic framework using LLM natural-language reasoning with self-reflection for answer discovery.
📝 Blog 3d ago
★ High Signal
NVIDIA's Nemotron 3 Super is a 120B/12B-active MoE model with hybrid Mamba-Attention architecture scoring 60.47% on SWE-Bench Verified—the highest open-weight coding score at launch. Features 1M context, 2.2x throughput improvement, and native speculative decoding for efficient agentic reasoning.
Qwen 3.6 35B A3B achieves 187 tokens/sec on RTX 5090 32GB at Q5_K_S quantization with 120K context. Performance benchmark for local inference. Demonstrates practical deployment of mid-size models on consumer hardware.
Controlled experiments on shortest-path planning reveal LLMs exhibit strong spatial generalization to unseen maps but fail at length scaling due to recursive instability. The synthetic environment cleanly separates training data, paradigms, and inference strategies to isolate generalization failure modes.
Systematic benchmark of multiple optimizers for MLP training on tabular data finds Muon consistently outperforms the standard AdamW. First comprehensive optimizer comparison for tabular deep learning, challenging the default choice practitioners use.
AD4AD benchmark evaluates Visual Anomaly Detection models for identifying out-of-distribution objects in autonomous driving, enabling systems to alert drivers when encountering unfamiliar situations. Produces pixel-level anomaly maps to guide attention to specific risk regions. Addresses safety-critical failure modes when perception systems encounter conditions outside training distribution.
Vision-language models struggle to recognize human emotions, underperforming even specialized vision-only classifiers despite progress on other visual tasks. The study identifies two critical vulnerabilities: long-tailed emotion dataset distributions exacerbated by web-scale pretraining, and challenges with continuous dynamic facial expression recognition. Reveals fundamental gap in VLM emotional understanding capabilities.
Stakes signaling vulnerability shows LLM-as-a-judge models systematically corrupt assessments when informed of downstream consequences their verdicts will have on evaluated models. Controlled experiments across 1,520 responses on safety and quality benchmarks demonstrate judges evaluate based on contextual framing rather than strictly on semantic content, undermining the operational backbone of automated AI evaluation pipelines.
MADE introduces a living multi-label text classification benchmark for medical device adverse events, continuously updated with new reports to prevent training data contamination. Features long-tailed hierarchical labels and enables uncertainty quantification evaluation critical for high-stakes healthcare ML. Addresses benchmark saturation and memorization vs. reasoning distinction.
MambaSL achieves state-of-the-art time series classification using a single-layer Mamba architecture with TSC-specific modifications. Re-evaluates 20 baselines across all 30 UEA datasets under unified protocol, demonstrating SSMs can excel at time series tasks with minimal architectural complexity.
QuantCode-Bench provides 400 tasks evaluating LLMs on generating executable algorithmic trading strategies for Backtrader from English descriptions. Unlike standard code benchmarks, requires domain-specific financial logic, specialized API knowledge, and code producing actual trades on historical data, with tasks sourced from Reddit, TradingView, and synthetic generators.
Proposes axiomatic benchmark for scientific novelty metrics that avoids confounded proxies like citation counts or peer review scores. Addresses fundamental evaluation challenge for AI scientist systems by enabling reliable, automated novelty assessment without conflating novelty with impact, quality, or reviewer preference.
DiscoTrace analyzes rhetorical strategies in information-seeking answers by representing them as sequences of discourse acts paired with question interpretations. Human communities show diverse answering preferences, while LLMs lack rhetorical diversity and systematically favor breadth over depth regardless of prompting. Reveals fundamental differences in how humans and models construct answers beyond surface-level content.
FedIDM addresses slow convergence and utility-robustness tradeoffs in Byzantine federated learning by using distribution matching to generate trustworthy condensed data that identifies malicious clients. The method filters abnormal updates through deviation detection and negative contribution rejection, achieving faster and more stable convergence against colluding attackers.
Multi-metric analysis of demographic fairness in ML reveals different fairness metrics produce conflicting assessments on the same system due to capturing distinct statistical properties. Using face recognition experiments, demonstrates that fairness evaluation reliability depends critically on metric choice, challenging assumptions of consistency.
ProVoice-Bench introduces the first evaluation framework for proactive voice agents with 1,182 samples across four tasks measuring intervention and monitoring capabilities. State-of-the-art multimodal LLMs show significant performance gaps particularly in over-triggering and reasoning, revealing limitations in current proactive agent paradigms.
🔶 Anthropic 4d ago
★ High Signal
Claude Opus 4.7 achieves 87.6% on SWE-bench Verified (13% improvement) with 2x throughput on agentic tasks while maintaining $5/$25 per million token pricing and full 1M context window. The performance gains make it effectively cheaper per task despite unchanged nominal pricing. Higher-resolution vision capabilities included.
Qwen3.6-35B-A3B running locally outperformed Claude Opus 4.7 on an SVG pelican generation task, demonstrating the narrowing capability gap between quantized open-weight models and proprietary APIs for specific visual generation benchmarks. The comparison highlights increasing viability of local inference despite not reflecting overall model capability.
DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.
AIMO 3 competition analysis across 50 IMO problems shows model capability dominates inference-time optimization; diverse prompting strategies fail to beat high-temperature sampling on strong models. The 8-point capability gap persists across all prompt interventions; only verifier-based selection could close remaining selection loss.
Community report of reproducibility crisis: 4 out of 7 recent ML papers failed to reproduce claimed results, with 2 having unresolved GitHub issues. Highlights growing concerns about research quality and verification standards. Reflects broader questions about publication incentives and validation rigor in current ML research.
ICLR 2025 Oral paper evaluated SQL code generation using natural language similarity metrics instead of execution-based validation, yielding ~20% false positive rate in authors' own testing. Community questions appropriateness of Oral designation given fundamental evaluation methodology flaw. Highlights peer review challenges in code generation benchmarks.
🧠 DeepMind 6d ago
★ High Signal
Google's Gemini 3 Deep Think achieves 48.4% on Humanity's Last Exam and 84.6% on ARC-AGI-2, now available to Ultra subscribers and select enterprise users. Early adopters use it to identify mathematical paper errors missed by peer review and optimize semiconductor crystal growth. Novel application of specialized reasoning mode to scientific and engineering problems beyond standard benchmarks.
MMLU and other 2024-dominant benchmarks now saturated (>95% on frontier models), relegated to "floor checks" rather than frontier separators. Frontier now decided by HLE, GPQA, MMLU-Pro, SWE-bench Pro, Terminal-Bench 2.0, BrowseComp for agentic tasks. Benchmark choice matters more than ever as academic standards become irrelevant for comparing top models.
Proactive Agent Research Environment simulates active users to evaluate AI assistants that anticipate needs and initiate actions rather than just responding to queries. Existing benchmarks lack realistic user simulation for testing proactive behaviors like timely suggestions and anticipatory information gathering. Bridges the gap between passive query-response evaluation and true assistant capabilities needed in high-stakes domains.
Production testing reveals Gemma 12B and Qwen 3.5 35B return correct answers in unparseable formats despite explicit instructions—Python instead of CSV, Markdown instead of CSV. Format compliance is independent capability missing from all major benchmarks (SWE-bench, Aider, LiveBench, SEAL), critical gap for production pipelines where consumers are parsers not humans. Smaller models fundamentally lack instruction-following precision for machine-readable output.
Interspeech 2026 challenge shifts audio AI evaluation from result-oriented to process-oriented reasoning quality using instance-level rubric-based evaluation. Champion agent integrated 40+ specialized audio tools achieving 69.83% Rubrics score. Emphasizes transparent reasoning over black-box performance metrics.