conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

Evaluating Temporal Consistency in Multi-Turn Language Models ACL 2026

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs ACL 2026

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing ACL 2026

UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment ACL 2026

Monotonic Scaffolding as a Diagnostic Lens for Legal Reasoning in LLMs ACL 2026

What About the Scene With the Hitler Reference? HAUNT: A Framework to Probe LLMs’ Self-consistency in Closed Domains Via Adversarial Nudge ACL 2026

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities ACL 2026

Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact ACL 2026

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms ACL 2026

HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences ACL 2026

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations ACL 2026

MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning? ACL 2026

REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation ACL 2026

A Multilingual Social Bias Benchmark Incorporating Thinking Processes ACL 2026

Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner ACL 2026

Anchoring Depends on Confidence and Post-Training in Language Models ACL 2026

ReproEvalCard: A Reporting Standard for Reproducible Evaluation of LLM Pipelines ACL 2026

Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers ACL 2026

CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models ACL 2026

Reliable Evaluation Protocol for Low-Precision Retrieval ACL 2026

When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation ACL 2026

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models ACL 2026

LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning ACL 2026

Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Retrospective Forecasting Case Study ACL 2026

Learning from Emptiness: De-biasing Listwise Rerankers with Content-Agnostic Probability Calibration ACL 2026