Research Explorer

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally et al.

2025 ACL

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge

Lin Shi, Chiyu Ma, Wenhua Liang et al.

2025 AACL

JuStRank: Benchmarking LLM Judges for System Ranking

Ariel Gera, Odellia Boni, Yotam Perlitz et al.

2025 ACL

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings

Austin Xu, Srijan Bansal, Yifei Ming et al.

2025 ACL

CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges

Haitao Li, Junjie Chen, Qingyao Ai et al.

2025 ACL

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi et al.

2025 ACL

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub et al.

2025 ACL

LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA

Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco et al.

2025 ACL

From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks

Andreas Stephan, Dawei Zhu, Matthias Aßenmacher et al.

2025 ACL

Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges

Rudali Huidrom, Anya Belz

2025 ACL

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Yuwei Zhao, Ziyang Luo, Yuchen Tian et al.

2025 COLING

Multi-Layered Evaluation Using a Fusion of Metrics and LLMs as Judges in Open-Domain Question Answering

Rashin Rahnamoun, Mehrnoush Shamsfard

2025 COLING

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Cristina Garbacea, Samuel Carton, Shiyan Yan et al.

2019 EMNLP

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen et al.

2023 EMNLP

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Noy Sternlicht, Ariel Gera, Roy Bar-Haim et al.

2025 EMNLP

Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation

Yerin Hwang, Dongryeol Lee, Kyungmin Min et al.

2025 EMNLP

CourtReasoner: Can LLM Agents Reason Like Judges?

Sophia Simeng Han, Yoshiki Takashima, Shannon Zejiang Shen et al.

2025 EMNLP

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub et al.

2025 EMNLP

Audio-Aware Large Language Models as Judges for Speaking Styles

Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin et al.

2025 EMNLP

Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Yerin Hwang, Dongryeol Lee, Taegwan Kang et al.

2025 EMNLP

Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation

Weiyuan Li, Xintao Wang, Siyu Yuan et al.

2025 EMNLP

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Sher Badshah, Hassan Sajjad

2025 EMNLP

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Jiaxin Ai, Pengfei Zhou, Zhaopan Xu et al.

2025 ICCV

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Lianghui Zhu, Xinggang Wang, Xinlong Wang

2025 ICLR

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.

2025 ICLR

Papers