conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?
ACL 2026
Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision
ACL 2026
FinCall-Surprise: A Large Scale Multi-modal Benchmark for Earning Surprise Prediction
ACL 2026
FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models
ACL 2026
From Charts to Code: A Hierarchical Benchmark for Multimodal Models
ACL 2026
Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation
ACL 2026
Omni-RewardBench: Toward a Comprehensive Evaluation of Generative Reward Models Across Modalities
ACL 2026
Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
ACL 2026
AI use in American newspapers is widespread, uneven, and rarely disclosed
ACL 2026
CASPER in the Machine: Insights into Character Variety in LLM-Generated Stories
ACL 2026
Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights
ACL 2026
PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
ACL 2026
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs
ACL 2026
When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
ACL 2026
Mediocrity is the key for LLM as a Judge Anchor Selection
ACL 2026
VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
ACL 2026
More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage
ACL 2026
BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
ACL 2026
Language Models Don’t Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
ACL 2026
Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors
ACL 2026
The Path Not Taken: Duality in Reasoning about Program Execution
ACL 2026
RExBench: Can coding agents autonomously implement AI research extensions?
ACL 2026
ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios
ACL 2026
Check Your Work: Structured Checklist Feedback for Improving Large Language Models
ACL 2026
When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors
ACL 2026
<
1
2
3
4
5
…
16
>