Self-hostable, Apache 2.0-licensed platform covering the full LLM application observability and improvement loop: tracing, evals, simulations, datasets, gateway, and guardrails in one stack. Targets teams who want an integrated alternative to stitching together Langfuse, LangSmith, and separate guardrail layers. Open-source with enterprise-grade feature breadth.
Side-by-side comparison showing GPT Image 2 struggles with photorealistic nature scenes, producing a recognizable artifacting pattern absent in its predecessor. Three images from the same prompt illustrate the regression, flagging a quality tradeoff in the new model for natural/outdoor imagery.
Investigates how prompt optimization and judge choice interact in LLM-as-a-Judge evaluations for legal QA on the LEXam benchmark, using ProTeGi optimization with Qwen3-32B and DeepSeek-V3 as judges. Lenient judge feedback yields larger and more consistent gains than strict feedback, and prompts optimized with lenient judges transfer better across judge models. Results highlight that judge disposition is a significant, underappreciated variable in automated evaluation pipelines.
A practitioner's post-mortem on building fully autonomous multi-agent systems for clients: unpredictable recursive loops, runaway API costs ($200 in 2 hours), and zero client tolerance for black-box failures pushed the author toward human-in-the-loop, deterministic workflows instead. The core argument — autonomy is a liability for most business use cases — is grounded in specific failure modes rather than theory.
Open-source test harness for text-to-CAD generation, providing scaffolding to prompt LLMs and evaluate their CAD model outputs. Targets the emerging niche of AI-driven parametric and 3D design automation.
Identifies 'preference leakage': when the same LLM generates synthetic training data and serves as the judge, it systematically inflates scores for outputs matching its own generation style, biasing leaderboard rankings even when models perform similarly. Demonstrated empirically across several evaluation pipelines. A concrete warning against self-referential LLM-as-a-judge setups.
SkillLearnBench is the first benchmark for continual skill learning in LLM agents, covering 20 verified tasks across 15 sub-domains with evaluation at three levels: skill quality, execution trajectory, and task outcome. Tested methods include one-shot learning, self/teacher feedback, and skill-creator approaches; all improve over the no-skill baseline but none achieves consistent gains across domains. Highlights that automatic skill acquisition for agents remains an unsolved problem despite recent progress.
💬 Reddit Apr 16
⭐ Editor's Pick
User benchmarks show Claude Opus 4.7 scoring 59.2% vs Opus 4.6's 91.9% on the MRCR v2 8-needle 256K context benchmark — a sharp context retention regression. Compounding the issue, a tokenizer change reportedly causes Opus 4.7 to consume ~1.35x more tokens than Opus 4.6 and ~2x more than competing proprietary models, effectively raising costs ~50% for equivalent workloads. If the benchmark numbers hold, this is a meaningful quality-cost tradeoff moving in the wrong direction.
A 668-point HN post documents that swapping the evaluation harness — without changing any model — improved measured coding performance across 15 LLMs in an afternoon. Directly implicates harness sensitivity as a major confounder in coding benchmark results. High-signal for anyone designing or interpreting code evals.