Papers

37 papers found
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally et al.
2025 ACL
2025 AACL
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera, Odellia Boni, Yotam Perlitz et al.
2025 ACL
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi et al.
2025 ACL
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub et al.
2025 ACL
LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA
Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco et al.
2025 ACL
From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks
Andreas Stephan, Dawei Zhu, Matthias Aßenmacher et al.
2025 ACL
2025 COLING
2023 EMNLP
2025 EMNLP
Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation
Yerin Hwang, Dongryeol Lee, Kyungmin Min et al.
2025 EMNLP
CourtReasoner: Can LLM Agents Reason Like Judges?
Sophia Simeng Han, Yoshiki Takashima, Shannon Zejiang Shen et al.
2025 EMNLP
Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub et al.
2025 EMNLP
Audio-Aware Large Language Models as Judges for Speaking Styles
Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin et al.
2025 EMNLP
Can You Trick the Grader? Adversarial Persuasion of LLM Judges
Yerin Hwang, Dongryeol Lee, Taegwan Kang et al.
2025 EMNLP
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu, Xinggang Wang, Xinlong Wang
2025 ICLR
JudgeBench: A Benchmark for Evaluating LLM-Based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.
2025 ICLR