Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
How Well Do Large Language Models Perform on Faux Pas Tests?
ACL 2023
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
ACL 2023
Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
ACL 2023
GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-Distribution Generalization Perspective
ACL 2023
Discovering Language Model Behaviors with Model-Written Evaluations
ACL 2023
Rarely a problem? Language models exhibit inverse scaling in their predictions following few-type quantifiers
ACL 2023
DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation
ACL 2023
Findings of the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages
ACL 2023
Scalable and Explainable Automated Scoring for Open-Ended Constructed Response Math Word Problems
ACL 2023
MoQA: Benchmarking Multi-Type Open-Domain Question Answering
ACL 2023
Follow the Knowledge: Structural Biases and Artefacts in Knowledge Grounded Dialog Datasets
ACL 2023
MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation
ACL 2023
Temporal and Second Language Influence on Intra-Annotator Agreement and Stability in Hate Speech Labelling
ACL 2023
GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation
ACL 2023
No Strong Feelings One Way or Another: Re-operationalizing Neutrality in Natural Language Inference
ACL 2023
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
ACL 2023
Large Language Models respond to Influence like Humans
ACL 2023
UNIDECOR: A Unified Deception Corpus for Cross-Corpus Deception Detection
ACL 2023
ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models
ACL 2023
Can ChatGPT Understand Causal Language in Science Claims?
ACL 2023
Benchmarking Offensive and Abusive Language in Dutch Tweets
ACL 2023
Harmful Language Datasets: An Assessment of Robustness
ACL 2023
Holistic Inter-Annotator Agreement and Corpus Coherence Estimation in a Large-scale Multilingual Annotation Campaign
EMNLP 2023
SLOG: A Structural Generalization Benchmark for Semantic Parsing
EMNLP 2023
Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks
EMNLP 2023
<
1
…
41
42
43
…
67
>