Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data
NAACL 2025
CHATTER: A character-attribution dataset for narrative understanding
NAACL 2025
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
ACL 2025
Benchmarking Failures in Tool-Augmented Language Models
NAACL 2025
As easy as PIE: understanding when pruning causes language models to disagree
NAACL 2025
Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
ICCV 2025
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs
ACL 2025
Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?
NAACL 2025
Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization
NAACL 2025
Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention
ICCV 2025
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Models
ACL 2025
FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models
NAACL 2025
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
NAACL 2025
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
NAACL 2025
MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset
ACL 2025
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
NAACL 2025
Where is this coming from? Making groundedness count in the evaluation of Document VQA models
NAACL 2025
DUTJBD at SemEval-2025 Task 3: A Range of Approaches for Predicting Hallucination Generation in Models
SEMEVAL 2025
Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance
ACL 2025
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
NAACL 2025
Predicting Fine-tuned Performance on Larger Datasets Before Creating Them
COLING 2025
Using Linguistic Entrainment to Evaluate Large Language Models for Use in Cognitive Behavioral Therapy
NAACL 2025
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
ACL 2025
Citation Drift: Measuring Reference Stability in Multi-Turn LLM Conversations
IJCNLP 2025
7 Points to Tsinghua but 10 Points to ? Assessing Large Language Models in Agentic Multilingual National Bias
ACL 2025
<
1
…
12
13
14
…
67
>