← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

CogLM: Tracking Cognitive Development of Large Language Models NAACL 2025

LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs NAACL 2025

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness NAACL 2025

Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data NAACL 2025

AnaScore: Understanding Semantic Parallelism in Proportional Analogies NAACL 2025

ITALIC: An Italian Culture-Aware Natural Language Benchmark NAACL 2025

NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models NAACL 2025

Benchmarking Failures in Tool-Augmented Language Models NAACL 2025

Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs? NAACL 2025

Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization NAACL 2025

Are We Done with MMLU? NAACL 2025

Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture NAACL 2025

FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models NAACL 2025

MojoBench: Language Modeling and Benchmarks for Mojo NAACL 2025

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference NAACL 2025

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy NAACL 2025

Where is this coming from? Making groundedness count in the evaluation of Document VQA models NAACL 2025

UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models NAACL 2025

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia NAACL 2025

As easy as PIE: understanding when pruning causes language models to disagree NAACL 2025

LOFT: Scalable and More Realistic Long-Context Evaluation NAACL 2025

Aligning Black-box Language Models with Human Judgments NAACL 2025

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation NAACL 2025

SEEval: Advancing LLM Text Evaluation Efficiency and Accuracy through Self-Explanation Prompting NAACL 2025

C2LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation ACL 2025