language model evaluation

221 papers

Explore in graph

Also known as

LLM EVALUATION LME LLM LMS

Co-occurring keywords

large language model (12755) language model (4573) benchmark evaluation (1539) multilingual nlp (1423) natural language understanding (845) text generation (2903) low-resource language (2234) evaluation benchmark (250) text classification (6776) question answering (2904)

Papers

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics EACL 2026

When Do Language Models Endorse Limitations on Human Rights Principles? EACL 2026

Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation AAAI 2026

Hey, wait a minute: on at-issue sensitivity in Language Models EACL 2026

Far Out: Evaluating Language Models on Slang in Australian and Indian English EACL 2026

CharBench: Evaluating the Role of Tokenization in Character-Level Tasks AAAI 2026

Can Large Language Models Unlock Novel Scientific Research Ideas? EMNLP 2025

Luna: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost COLING 2025

TounsiBench: Benchmarking Large Language Models for Tunisian Arabic EMNLP 2025

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL 2025

Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language ACL 2025

Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events ACL 2025

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization ACL 2025

Compositionality and Event Retrieval in Complement Coercion: A Study of Language Models in a Low-resource Setting CONLL 2025

Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks EMNLP 2025

Semantic Inversion, Identical Replies: Revisiting Negation Blindness in Large Language Models EMNLP 2025

LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis IJCNLP 2025

VMLU Benchmarks: A comprehensive benchmark toolkit for Vietnamese LLMs ACL 2025

BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models ACL 2025

When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification ACL 2025

OLMES: A Standard for Language Model Evaluations NAACL 2025

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions NAACL 2025

M-IFEval: Multilingual Instruction-Following Evaluation NAACL 2025

Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance NAACL 2025

YNU-HPCC at SemEval-2025 Task3: Leveraging Zero-Shot Learning for Halluciantion Detection ACL 2025