conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

UrbanGeoEval: A City-Scale Benchmark for Evaluating Large Language Models in Geospatial Reasoning ACL 2026

Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models ACL 2026

MirrorQA: Benchmarking Multimodal LLMs on Mirror-Orientation Reasoning ACL 2026

GeoRC: A Benchmark for Geolocation Reasoning Chains ACL 2026

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty ACL 2026

Tiny Scales, Great Challenges: The Limits of Multimodal LLMs in Scale Recognition ACL 2026

VISTA: Verification In Sequential Turn-based Assessment ACL 2026

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human–LLM Collaborative Writing ACL 2026

CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language ACL 2026

GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing ACL 2026

ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents ACL 2026

Multimodal Safety Evaluation in Generative Agent Social Simulations ACL 2026

RealChart2Code: Bridging the Gap in Real-World Chart-to-Code Generation via Multi-Task Evaluation ACL 2026

VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking ACL 2026

Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams ACL 2026

Evaluating Language Model Pluralism through In-the-wild Crowd Discussions ACL 2026

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning ACL 2026

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation ACL 2026

Effects of Collaboration on the Performance of Interactive Theme Discovery Systems ACL 2026

GUIDE: Towards Scalable Advising for Research Ideas ACL 2026

Explain the Synth: Interpretable Evaluation of LLM Data Synthesis ACL 2026

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models ACL 2026

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection ACL 2026

C-World: A Computer Use Agent Environment Creator ACL 2026

Simple Agents, Biased Judges: Efficient Multi-Party Dialogue Generation & The Evaluation Gap ACL 2026