Papers

37 papers found

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Benjamin Feuer, Micah Goldblum, Teresa Datta et al.

2025 ICLR

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

2025 ICLR

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Cristina Garbacea, Samuel Carton, Shiyan Yan et al.

2019 IJCNLP

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge

Lin Shi, Chiyu Ma, Wenhua Liang et al.

2025 IJCNLP

Illusions of Relevance: Arbitrary Content Injection Attacks Deceive Retrievers, Rerankers, and LLM Judges

Manveer Singh Tamber, Jimmy Lin

2025 IJCNLP

Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation

Dongryeol Lee, Yerin Hwang, Yongil Kim et al.

2025 NAACL

Becoming Experienced Judges: Selective Test-Time Learning for Evaluators

Seungyeon Jwa, Daechul Ahn, Reokyoung Kim et al.

2026 EACL

Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden’s J statistic

Stephane Collot, Colin Fraser, Justin Zhao et al.

2026 EACL

Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

Jiwon Moon, Yerin Hwang, Dongryeol Lee et al.

2026 EACL

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils et al.

2026 EACL

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

Isaac Chung, Linda Freienthal

2026 EACL

Illusions of Relevance: Arbitrary Content Injection Attacks Deceive Retrievers, Rerankers, and LLM Judges

Manveer Singh Tamber, Jimmy Lin

2025 AACL