🍡 feedmeAI
← All topics
Evaluation 9 items

Everything Evaluation

🐙 GitHub Apr 23

future-agi/future-agi: Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Guardrails. Self-hostable. Apache 2.0.

Self-hostable, Apache 2.0-licensed platform covering the full LLM application observability and improvement loop: tracing, evals, simulations, datasets, gateway, and guardrails in one stack. Targets teams who want an integrated alternative to stitching together Langfuse, LangSmith, and separate guardrail layers. Open-source with enterprise-grade feature breadth.

📑 arXiv Apr 22

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

Investigates how prompt optimization and judge choice interact in LLM-as-a-Judge evaluations for legal QA on the LEXam benchmark, using ProTeGi optimization with Qwen3-32B and DeepSeek-V3 as judges. Lenient judge feedback yields larger and more consistent gains than strict feedback, and prompts optimized with lenient judges transfer better across judge models. Results highlight that judge disposition is a significant, underappreciated variable in automated evaluation pipelines.

💬 Reddit Apr 22

Why I Stopped Building Autonomous Agents for Clients

A practitioner's post-mortem on building fully autonomous multi-agent systems for clients: unpredictable recursive loops, runaway API costs ($200 in 2 hours), and zero client tolerance for black-box failures pushed the author toward human-in-the-loop, deterministic workflows instead. The core argument — autonomy is a liability for most business use cases — is grounded in specific failure modes rather than theory.

📑 arXiv Apr 22

Preference Leakage: A Contamination Problem in LLM-as-a-Judge (ICLR 2026)

Identifies 'preference leakage': when the same LLM generates synthetic training data and serves as the judge, it systematically inflates scores for outputs matching its own generation style, biasing leaderboard rankings even when models perform similarly. Demonstrated empirically across several evaluation pipelines. A concrete warning against self-referential LLM-as-a-judge setups.

🤗 Hugging Face Apr 22

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

SkillLearnBench is the first benchmark for continual skill learning in LLM agents, covering 20 verified tasks across 15 sub-domains with evaluation at three levels: skill quality, execution trajectory, and task outcome. Tested methods include one-shot learning, self/teacher feedback, and skill-creator approaches; all improve over the no-skill baseline but none achieves consistent gains across domains. Highlights that automatic skill acquisition for agents remains an unsolved problem despite recent progress.

💬 Reddit Apr 16
⭐ Editor's Pick

Opus 4.7 is 50% more expensive with context regression?!

User benchmarks show Claude Opus 4.7 scoring 59.2% vs Opus 4.6's 91.9% on the MRCR v2 8-needle 256K context benchmark — a sharp context retention regression. Compounding the issue, a tokenizer change reportedly causes Opus 4.7 to consume ~1.35x more tokens than Opus 4.6 and ~2x more than competing proprietary models, effectively raising costs ~50% for equivalent workloads. If the benchmark numbers hold, this is a meaningful quality-cost tradeoff moving in the wrong direction.