conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
UrbanGeoEval: A City-Scale Benchmark for Evaluating Large Language Models in Geospatial Reasoning
ACL 2026
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
ACL 2026
MirrorQA: Benchmarking Multimodal LLMs on Mirror-Orientation Reasoning
ACL 2026
GeoRC: A Benchmark for Geolocation Reasoning Chains
ACL 2026
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
ACL 2026
Tiny Scales, Great Challenges: The Limits of Multimodal LLMs in Scale Recognition
ACL 2026
VISTA: Verification In Sequential Turn-based Assessment
ACL 2026
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human–LLM Collaborative Writing
ACL 2026
CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language
ACL 2026
GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing
ACL 2026
ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents
ACL 2026
Multimodal Safety Evaluation in Generative Agent Social Simulations
ACL 2026
RealChart2Code: Bridging the Gap in Real-World Chart-to-Code Generation via Multi-Task Evaluation
ACL 2026
VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking
ACL 2026
Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams
ACL 2026
Evaluating Language Model Pluralism through In-the-wild Crowd Discussions
ACL 2026
PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
ACL 2026
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
ACL 2026
Effects of Collaboration on the Performance of Interactive Theme Discovery Systems
ACL 2026
GUIDE: Towards Scalable Advising for Research Ideas
ACL 2026
Explain the Synth: Interpretable Evaluation of LLM Data Synthesis
ACL 2026
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
ACL 2026
When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection
ACL 2026
C-World: A Computer Use Agent Environment Creator
ACL 2026
Simple Agents, Biased Judges: Efficient Multi-Party Dialogue Generation & The Evaluation Gap
ACL 2026
<
1
…
11
12
13
…
16
>