← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation NAACL 2025

Hard Emotion Test Evaluation Sets for Language Models NAACL 2025

UCL-Bench: A Chinese User-Centric Legal Benchmark for Large Language Models NAACL 2025

Evaluating Numeracy of Language Models as a Natural Language Inference Task NAACL 2025

Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages NAACL 2025

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics NAACL 2025

Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation NAACL 2025

Towards Region-aware Bias Evaluation Metrics NAACL 2025

Assessing the Reliability and Validity of GPT-4 in Annotating Emotion Appraisal Ratings NAACL 2025

Measuring Mental Health Variables in Computational Research: Toward Validated, Dimensional, and Transdiagnostic Approaches NAACL 2025

An Analysis of Scoring Methods for Reranking in Large Language Model Story Generation NAACL 2025

Does Training on Synthetic Data Make Models Less Robust? NAACL 2025

Exploring Limitations of LLM Capabilities with Multi-Problem Evaluation NAACL 2025

Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning NAACL 2025

UTER: Capturing the Human Touch in Evaluating Morphologically Rich and Low-Resource Languages NAACL 2025

Analyzing Large Language Models’ pastiche ability: a case study on a 20th century Romanian author NAACL 2025

Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries NAACL 2025

Difficulty Estimation in Natural Language Tasks with Action Scores NAACL 2025

Defining and Quantifying Visual Hallucinations in Vision-Language Models NAACL 2025

Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance NAACL 2025

Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation NAACL 2025

A Theoretical Framework for Evaluating Narrative Surprise in Large Language Models NAACL 2025

CHATTER: A character-attribution dataset for narrative understanding NAACL 2025

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants? EMNLP 2025

MCQFormatBench: Robustness Tests for Multiple-Choice Questions ACL 2025