conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs ACL 2026

Neo-Classic: A Benchmark for Evaluating Linguistic-Aesthetic Reasoning in Classical Chinese Poetry ACL 2026

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension ACL 2026

Video-MMMU: Evaluating Knowledge Acquisition from Multidisciplinary Professional Videos ACL 2026

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once ACL 2026

VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation Agents ACL 2026

ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models ACL 2026

Benchmarking Deflection and Hallucination in Large Vision-Language Models ACL 2026

KoCo-Bench: Can Large Language Models Leverage Domain Knowledge in Software Development? ACL 2026

MagicBench: Diagnosing Visual Agency Loss and Semantic Dependency in Multimodal LLMs ACL 2026

Bloom-Eval: A Hierarchical Evaluation Benchmark for Automatic Survey Generation Based on Bloom’s Taxonomy ACL 2026

FactVerse: A Benchmark for Factual Consistency in Interleaved Image–Text Generation ACL 2026

EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions ACL 2026

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator ACL 2026

SILO-BENCH: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems ACL 2026

Whose Facts Win? LLM Source Preferences under Knowledge Conflicts ACL 2026

LLMs (Almost) Never Abstain Under Medical Uncertainty ACL 2026

LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring ACL 2026

MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents ACL 2026

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination ACL 2026

Controllable Contamination Detection for Reliable LLM Evaluation with Statistical Guarantees ACL 2026

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions ACL 2026

ArgGenBench: Benchmarking the Complex Controlled Argument Generation Capability of Large Language Models ACL 2026

Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications ACL 2026

Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models ACL 2026