📑 arXiv 3d ago
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Split conformal prediction applied to LLM-as-judge frameworks reveals reliability issues masked by aggregate metrics: 33-67% of documents show transitivity violations despite low average rates, and prediction set width serves as a per-instance reliability indicator with strong correlation to actual uncertainty. The approach provides theoretically-guaranteed coverage bounds for judge outputs.