← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Towards General Error Diagnosis via Behavioral Testing in Machine Translation EMNLP 2023

Evaluating Verifiability in Generative Search Engines EMNLP 2023

VIPHY: Probing “Visible” Physical Commonsense Knowledge EMNLP 2023

“You Are An Expert Linguistic Annotator”: Limits of LLMs as Analyzers of Abstract Meaning Representation EMNLP 2023

USB: A Unified Summarization Benchmark Across Tasks and Domains EMNLP 2023

FactSpotter: Evaluating the Factual Faithfulness of Graph-to-Text Generation EMNLP 2023

CCEval: A Representative Evaluation Benchmark for the Chinese-centric Multilingual Machine Translation EMNLP 2023

Automatic Analysis of Substantiation in Scientific Peer Reviews EMNLP 2023

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark EMNLP 2023

Debias NLU Datasets via Training-free Perturbations EMNLP 2023

Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate EMNLP 2023

Probing the “Creativity” of Large Language Models: Can models produce divergent semantic association? EMNLP 2023

On Event Individuation for Document-Level Information Extraction EMNLP 2023

TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks EMNLP 2023

Annotation Sensitivity: Training Data Collection Methods Affect Model Performance EMNLP 2023

NarrativeXL: a Large-scale Dataset for Long-Term Memory Models EMNLP 2023

SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency EMNLP 2023

Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic EMNLP 2023

Predicting Question-Answering Performance of Large Language Models through Semantic Consistency EMNLP 2023

Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses EMNLP 2023

To Burst or Not to Burst: Generating and Quantifying Improbable Text EMNLP 2023

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs EMNLP 2023

Evaluating Neural Language Models as Cognitive Models of Language Acquisition EMNLP 2023

Walking a Tightrope – Evaluating Large Language Models in High-Risk Domains EMNLP 2023

mSCAN: A Dataset for Multilingual Compositional Generalisation Evaluation EMNLP 2023