← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Can AI Make Us Laugh? Comparing Jokes Generated by Witscript and a Human Expert COLING 2025

Do not Abstain! Identify and Solve the Uncertainty ACL 2025

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments EMNLP 2025

LLM-based post-editing as reference-free GEC evaluation ACL 2025

Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages NAACL 2025

InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles EMNLP 2025

Evaluating Numeracy of Language Models as a Natural Language Inference Task NAACL 2025

RCScore: Quantifying Response Consistency in Large Language Models EMNLP 2025

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? ACL 2025

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain EMNLP 2025

Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities COLING 2025

Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons EMNLP 2025

Don’t Score too Early! Evaluating Argument Mining Models on Incomplete Essays ACL 2025

NOVA-63: Native Omni-lingual Versatile Assessments of 63 Disciplines EMNLP 2025

MultiConIR: Towards Multi-Condition Information Retrieval EMNLP 2025

Confounding Factors in Relating Model Performance to Morphology EMNLP 2025

Comparing Behavioral Patterns of LLM and Human Tutors: A Population-level Analysis with the CIMA Dataset ACL 2025

UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models EMNLP 2025

Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting COLING 2025

PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization EMNLP 2025

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors ACL 2025

We Need to Measure Data Diversity in NLP — Better and Broader EMNLP 2025

Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning NAACL 2025

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs’ Responsiveness to Human Feedback EMNLP 2025

Stop Playing the Guessing Game! Evaluating Conversational Recommender Systems via Target-free User Simulation EMNLP 2025