Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
CogLM: Tracking Cognitive Development of Large Language Models
NAACL 2025
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
NAACL 2025
Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness
NAACL 2025
Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data
NAACL 2025
AnaScore: Understanding Semantic Parallelism in Proportional Analogies
NAACL 2025
ITALIC: An Italian Culture-Aware Natural Language Benchmark
NAACL 2025
NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models
NAACL 2025
Benchmarking Failures in Tool-Augmented Language Models
NAACL 2025
Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?
NAACL 2025
Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization
NAACL 2025
Are We Done with MMLU?
NAACL 2025
Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture
NAACL 2025
FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models
NAACL 2025
MojoBench: Language Modeling and Benchmarks for Mojo
NAACL 2025
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
NAACL 2025
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
NAACL 2025
Where is this coming from? Making groundedness count in the evaluation of Document VQA models
NAACL 2025
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
NAACL 2025
SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
NAACL 2025
As easy as PIE: understanding when pruning causes language models to disagree
NAACL 2025
LOFT: Scalable and More Realistic Long-Context Evaluation
NAACL 2025
Aligning Black-box Language Models with Human Judgments
NAACL 2025
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
NAACL 2025
SEEval: Advancing LLM Text Evaluation Efficiency and Accuracy through Self-Explanation Prompting
NAACL 2025
C2LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation
ACL 2025
<
1
…
19
20
21
…
67
>