conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

The “Knowledge–Behavior Gap” in Cultural Taboo Safety of Large Language Models ACL 2026

OSCBench: Benchmarking Object State Change in Text-to-Video Generation ACL 2026

RubricBench: Aligning Model-Generated Rubrics with Human Standards ACL 2026

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation ACL 2026

CogToM: A Comprehensive Theory of Mind Benchmark inspired by Human Cognition for Large Language Models ACL 2026

CITE: Benchmarking Heterogeneous Text-Attributed Graph Models ACL 2026

Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length ACL 2026

GQLBench: A Large-Scale Cross-Domain, Cross-Dialect Benchmark for NL2GQL ACL 2026

ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents ACL 2026

StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall ACL 2026

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge ACL 2026

Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection ACL 2026

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future ACL 2026

MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics ACL 2026

Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance ACL 2026

Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models ACL 2026

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data ACL 2026

PIArena: A Platform for Prompt Injection Evaluation ACL 2026

ChemReason-Bench: Benchmarking Large Language Models for Procedural Reasoning in Experimental Chemistry ACL 2026

Ranking Reasoning LLMs under Test-Time Scaling ACL 2026

CloneMem: Benchmarking Long-Term Memory for AI Clones ACL 2026

PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations ACL 2026

UniDataBench: Evaluating Data Analytics Agents Across Structured and Unstructured Data ACL 2026

Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework ACL 2026

MMSciCode: Real-world Evaluation of Multilingual Multi-Discipline Scientific Research Coding ACL 2026