Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Towards General Error Diagnosis via Behavioral Testing in Machine Translation
EMNLP 2023
Evaluating Verifiability in Generative Search Engines
EMNLP 2023
VIPHY: Probing “Visible” Physical Commonsense Knowledge
EMNLP 2023
“You Are An Expert Linguistic Annotator”: Limits of LLMs as Analyzers of Abstract Meaning Representation
EMNLP 2023
USB: A Unified Summarization Benchmark Across Tasks and Domains
EMNLP 2023
FactSpotter: Evaluating the Factual Faithfulness of Graph-to-Text Generation
EMNLP 2023
CCEval: A Representative Evaluation Benchmark for the Chinese-centric Multilingual Machine Translation
EMNLP 2023
Automatic Analysis of Substantiation in Scientific Peer Reviews
EMNLP 2023
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark
EMNLP 2023
Debias NLU Datasets via Training-free Perturbations
EMNLP 2023
Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate
EMNLP 2023
Probing the “Creativity” of Large Language Models: Can models produce divergent semantic association?
EMNLP 2023
On Event Individuation for Document-Level Information Extraction
EMNLP 2023
TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks
EMNLP 2023
Annotation Sensitivity: Training Data Collection Methods Affect Model Performance
EMNLP 2023
NarrativeXL: a Large-scale Dataset for Long-Term Memory Models
EMNLP 2023
SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency
EMNLP 2023
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic
EMNLP 2023
Predicting Question-Answering Performance of Large Language Models through Semantic Consistency
EMNLP 2023
Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses
EMNLP 2023
To Burst or Not to Burst: Generating and Quantifying Improbable Text
EMNLP 2023
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs
EMNLP 2023
Evaluating Neural Language Models as Cognitive Models of Language Acquisition
EMNLP 2023
Walking a Tightrope – Evaluating Large Language Models in High-Risk Domains
EMNLP 2023
mSCAN: A Dataset for Multilingual Compositional Generalisation Evaluation
EMNLP 2023
<
1
…
37
38
39
…
67
>