← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

How Well Do Large Language Models Perform on Faux Pas Tests? ACL 2023

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark ACL 2023

Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking ACL 2023

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-Distribution Generalization Perspective ACL 2023

Discovering Language Model Behaviors with Model-Written Evaluations ACL 2023

Rarely a problem? Language models exhibit inverse scaling in their predictions following few-type quantifiers ACL 2023

DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation ACL 2023

Findings of the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages ACL 2023

Scalable and Explainable Automated Scoring for Open-Ended Constructed Response Math Word Problems ACL 2023

MoQA: Benchmarking Multi-Type Open-Domain Question Answering ACL 2023

Follow the Knowledge: Structural Biases and Artefacts in Knowledge Grounded Dialog Datasets ACL 2023

MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation ACL 2023

Temporal and Second Language Influence on Intra-Annotator Agreement and Stability in Hate Speech Labelling ACL 2023

GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation ACL 2023

No Strong Feelings One Way or Another: Re-operationalizing Neutrality in Natural Language Inference ACL 2023

LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models ACL 2023

Large Language Models respond to Influence like Humans ACL 2023

UNIDECOR: A Unified Deception Corpus for Cross-Corpus Deception Detection ACL 2023

ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models ACL 2023

Can ChatGPT Understand Causal Language in Science Claims? ACL 2023

Benchmarking Offensive and Abusive Language in Dutch Tweets ACL 2023

Harmful Language Datasets: An Assessment of Robustness ACL 2023

Holistic Inter-Annotator Agreement and Corpus Coherence Estimation in a Large-scale Multilingual Annotation Campaign EMNLP 2023

SLOG: A Structural Generalization Benchmark for Semantic Parsing EMNLP 2023

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks EMNLP 2023