← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Bugs in the Data: How ImageNet Misrepresents Biodiversity AAAI 2023

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks ACL 2023

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation ACL 2023

Evaluating Open-Domain Question Answering in the Era of Large Language Models ACL 2023

Morphological Inflection: A Reality Check ACL 2023

Measuring the Instability of Fine-Tuning ACL 2023

Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-text Rationales ACL 2023

What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization ACL 2023

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors ACL 2023

What’s the Meaning of Superhuman Performance in Today’s NLU? ACL 2023

Extrinsic Evaluation of Machine Translation Metrics ACL 2023

EPIC: Multi-Perspective Annotation of a Corpus of Irony ACL 2023

FERMAT: An Alternative to Accuracy for Numerical Reasoning ACL 2023

Revisiting Commonsense Reasoning in Machine Translation: Training, Evaluation and Challenge ACL 2023

A Holistic Approach to Reference-Free Evaluation of Machine Translation ACL 2023

Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios ACL 2023

Revisiting Automated Prompting: Are We Actually Doing Better? ACL 2023

Mind the Gap between the Application Track and the Real World ACL 2023

TeCS: A Dataset and Benchmark for Tense Consistency of Machine Translation ACL 2023

Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP ACL 2023

Evaluating the Factual Consistency of Large Language Models Through News Summarization ACL 2023

Pulling Out All The Full Stops: Punctuation Sensitivity in Neural Machine Translation and Evaluation ACL 2023

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation ACL 2023

RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question ACL 2023

Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models ACL 2023