conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
Evaluating Temporal Consistency in Multi-Turn Language Models
ACL 2026
COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs
ACL 2026
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
ACL 2026
UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment
ACL 2026
Monotonic Scaffolding as a Diagnostic Lens for Legal Reasoning in LLMs
ACL 2026
What About the Scene With the Hitler Reference? HAUNT: A Framework to Probe LLMs’ Self-consistency in Closed Domains Via Adversarial Nudge
ACL 2026
GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities
ACL 2026
Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
ACL 2026
Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms
ACL 2026
HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences
ACL 2026
Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations
ACL 2026
MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?
ACL 2026
REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation
ACL 2026
A Multilingual Social Bias Benchmark Incorporating Thinking Processes
ACL 2026
Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner
ACL 2026
Anchoring Depends on Confidence and Post-Training in Language Models
ACL 2026
ReproEvalCard: A Reporting Standard for Reproducible Evaluation of LLM Pipelines
ACL 2026
Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers
ACL 2026
CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models
ACL 2026
Reliable Evaluation Protocol for Low-Precision Retrieval
ACL 2026
When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
ACL 2026
Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
ACL 2026
LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
ACL 2026
Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Retrospective Forecasting Case Study
ACL 2026
Learning from Emptiness: De-biasing Listwise Rerankers with Content-Agnostic Probability Calibration
ACL 2026
<
1
…
12
13
14
15
16
>