conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain? ACL 2026

Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision ACL 2026

FinCall-Surprise: A Large Scale Multi-modal Benchmark for Earning Surprise Prediction ACL 2026

FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models ACL 2026

From Charts to Code: A Hierarchical Benchmark for Multimodal Models ACL 2026

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation ACL 2026

Omni-RewardBench: Toward a Comprehensive Evaluation of Generative Reward Models Across Modalities ACL 2026

Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI ACL 2026

AI use in American newspapers is widespread, uneven, and rarely disclosed ACL 2026

CASPER in the Machine: Insights into Character Variety in LLM-Generated Stories ACL 2026

Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights ACL 2026

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering ACL 2026

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs ACL 2026

When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias ACL 2026

Mediocrity is the key for LLM as a Judge Anchor Selection ACL 2026

VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models ACL 2026

More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage ACL 2026

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks ACL 2026

Language Models Don’t Know What You Want: Evaluating Personalization in Deep Research Needs Real Users ACL 2026

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors ACL 2026

The Path Not Taken: Duality in Reasoning about Program Execution ACL 2026

RExBench: Can coding agents autonomously implement AI research extensions? ACL 2026

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios ACL 2026

Check Your Work: Structured Checklist Feedback for Improving Large Language Models ACL 2026

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors ACL 2026