language model evaluation

221 papers

Explore in graph

Also known as

LLM EVALUATION LME LLM LMS

Co-occurring keywords

large language model (12755) language model (4573) benchmark evaluation (1539) multilingual nlp (1423) natural language understanding (845) text generation (2903) low-resource language (2234) evaluation benchmark (250) text classification (6776) question answering (2904)

Papers

EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models EMNLP 2025

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets EMNLP 2025

Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique EMNLP 2025

BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models ACL 2025

Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs NAACL 2025

DUTJBD at SemEval-2025 Task 3: A Range of Approaches for Predicting Hallucination Generation in Models SEMEVAL 2025

Evaluating Dialect Robustness of Language Models via Conversation Understanding COLING 2025

Evaluating Numeracy of Language Models as a Natural Language Inference Task NAACL 2025

A Fair Comparison without Translationese: English vs. Target-language Instructions for Multilingual LLMs NAACL 2025

Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework ACL 2025

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks ACL 2025

OLMES: A Standard for Language Model Evaluations NAACL 2025

Hard Emotion Test Evaluation Sets for Language Models NAACL 2025

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL 2025

neDIOM: Dataset and Analysis of Nepali Idioms COLING 2025

DharmaBench: Evaluating Language Models on Buddhist Texts in Sanskrit and Tibetan IJCNLP 2025

Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics IJCNLP 2025

Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks EMNLP 2025

BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models EMNLP 2025

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models EMNLP 2025

Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility ACL 2025

Adaptively profiling models with task elicitation EMNLP 2025

UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation EMNLP 2025

Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede’s Cultural Dimensions COLING 2025

What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction IJCNLP 2025