Papers
37 papers found
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
Benjamin Feuer, Micah Goldblum, Teresa Datta et al.
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
Jaehun Jung, Faeze Brahman, Yejin Choi
Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation
Cristina Garbacea, Samuel Carton, Shiyan Yan et al.
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
Lin Shi, Chiyu Ma, Wenhua Liang et al.
Illusions of Relevance: Arbitrary Content Injection Attacks Deceive Retrievers, Rerankers, and LLM Judges
Manveer Singh Tamber, Jimmy Lin
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
Dongryeol Lee, Yerin Hwang, Yongil Kim et al.
Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Seungyeon Jwa, Daechul Ahn, Reokyoung Kim et al.
Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden’s J statistic
Stephane Collot, Colin Fraser, Justin Zhao et al.
Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
Jiwon Moon, Yerin Hwang, Dongryeol Lee et al.
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils et al.
Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
Isaac Chung, Linda Freienthal
Illusions of Relevance: Arbitrary Content Injection Attacks Deceive Retrievers, Rerankers, and LLM Judges
Manveer Singh Tamber, Jimmy Lin