🍡 feedmeAI
← All topics
Reliability 2 items

Everything Reliability

📑 arXiv 3d ago

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Split conformal prediction applied to LLM-as-judge frameworks reveals reliability issues masked by aggregate metrics: 33-67% of documents show transitivity violations despite low average rates, and prediction set width serves as a per-instance reliability indicator with strong correlation to actual uncertainty. The approach provides theoretically-guaranteed coverage bounds for judge outputs.