Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Bugs in the Data: How ImageNet Misrepresents Biodiversity
AAAI 2023
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
ACL 2023
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
ACL 2023
Evaluating Open-Domain Question Answering in the Era of Large Language Models
ACL 2023
Morphological Inflection: A Reality Check
ACL 2023
Measuring the Instability of Fine-Tuning
ACL 2023
Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-text Rationales
ACL 2023
What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization
ACL 2023
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors
ACL 2023
What’s the Meaning of Superhuman Performance in Today’s NLU?
ACL 2023
Extrinsic Evaluation of Machine Translation Metrics
ACL 2023
EPIC: Multi-Perspective Annotation of a Corpus of Irony
ACL 2023
FERMAT: An Alternative to Accuracy for Numerical Reasoning
ACL 2023
Revisiting Commonsense Reasoning in Machine Translation: Training, Evaluation and Challenge
ACL 2023
A Holistic Approach to Reference-Free Evaluation of Machine Translation
ACL 2023
Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios
ACL 2023
Revisiting Automated Prompting: Are We Actually Doing Better?
ACL 2023
Mind the Gap between the Application Track and the Real World
ACL 2023
TeCS: A Dataset and Benchmark for Tense Consistency of Machine Translation
ACL 2023
Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP
ACL 2023
Evaluating the Factual Consistency of Large Language Models Through News Summarization
ACL 2023
Pulling Out All The Full Stops: Punctuation Sensitivity in Neural Machine Translation and Evaluation
ACL 2023
Correction of Errors in Preference Ratings from Automated Metrics for Text Generation
ACL 2023
RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question
ACL 2023
Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models
ACL 2023
<
1
…
43
44
45
…
67
>