← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models ACL 2025

A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation ACL 2025

Difficulty Estimation in Natural Language Tasks with Action Scores NAACL 2025

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? ACL 2025

Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries NAACL 2025

Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events ACL 2025

Defining and Quantifying Visual Hallucinations in Vision-Language Models NAACL 2025

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? ACL 2025

BI-Bench : A Comprehensive Benchmark Dataset and Unsupervised Evaluation for BI Systems ACL 2025

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA ACL 2025

“Stupid robot, I want to speak to a human!” User Frustration Detection in Task-Oriented Dialog Systems COLING 2025

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs ACL 2025

A Theoretical Framework for Evaluating Narrative Surprise in Large Language Models NAACL 2025

LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers ACL 2025

SEA-HELM: Southeast Asian Holistic Evaluation of Language Models ACL 2025

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era ACL 2025

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment ACL 2025

REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research? ACL 2025

Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations WACV 2025

Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models ACL 2025

Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models ACL 2025

Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated ACL 2025

Noise-Aware Evaluation of Object Detectors WACV 2025

Frame by Familiar Frame: Understanding Replication in Video Diffusion Models WACV 2025

Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls EMNLP 2025