Reliability — Topic

💬 Reddit 2d ago

Opus 4.7 is terrible, and Anthropic has completely dropped the ball

Users report degraded quality in Claude Opus 4.7 for complex reasoning tasks in theoretical math and physics, citing frequent downtime and performance drops compared to version 4.6. Multiple researchers considering switching back to ChatGPT despite previous preference for Claude.

Models Reasoning Reliability

📑 arXiv 3d ago

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Split conformal prediction applied to LLM-as-judge frameworks reveals reliability issues masked by aggregate metrics: 33-67% of documents show transitivity violations despite low average rates, and prediction set width serves as a per-instance reliability indicator with strong correlation to actual uncertainty. The approach provides theoretically-guaranteed coverage bounds for judge outputs.

Evaluation Reliability Uncertainty-quantification

Opus 4.7 is terrible, and Anthropic has completely dropped the ball ↗

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations ↗

Opus 4.7 is terrible, and Anthropic has completely dropped the ball

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations