conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs
ACL 2026
Neo-Classic: A Benchmark for Evaluating Linguistic-Aesthetic Reasoning in Classical Chinese Poetry
ACL 2026
RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension
ACL 2026
Video-MMMU: Evaluating Knowledge Acquisition from Multidisciplinary Professional Videos
ACL 2026
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
ACL 2026
VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation Agents
ACL 2026
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
ACL 2026
Benchmarking Deflection and Hallucination in Large Vision-Language Models
ACL 2026
KoCo-Bench: Can Large Language Models Leverage Domain Knowledge in Software Development?
ACL 2026
MagicBench: Diagnosing Visual Agency Loss and Semantic Dependency in Multimodal LLMs
ACL 2026
Bloom-Eval: A Hierarchical Evaluation Benchmark for Automatic Survey Generation Based on Bloom’s Taxonomy
ACL 2026
FactVerse: A Benchmark for Factual Consistency in Interleaved Image–Text Generation
ACL 2026
EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions
ACL 2026
N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
ACL 2026
SILO-BENCH: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems
ACL 2026
Whose Facts Win? LLM Source Preferences under Knowledge Conflicts
ACL 2026
LLMs (Almost) Never Abstain Under Medical Uncertainty
ACL 2026
LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring
ACL 2026
MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents
ACL 2026
Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
ACL 2026
Controllable Contamination Detection for Reliable LLM Evaluation with Statistical Guarantees
ACL 2026
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
ACL 2026
ArgGenBench: Benchmarking the Complex Controlled Argument Generation Capability of Large Language Models
ACL 2026
Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications
ACL 2026
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
ACL 2026
<
1
…
7
8
9
…
16
>