conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

Evaluating Visual Narrative Coherence in Story Visualization via Diversified Storylines ACL 2026

Improving the Distributional Alignment of LLMs using Supervision ACL 2026

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents ACL 2026

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality ACL 2026

ClaimDB: A Fact Verification Benchmark over Large Structured Data ACL 2026

SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning ACL 2026

DRInQ: Evaluating Conversational Implicature with Controlled Context Variation ACL 2026

TPS-Bench: Evaluating AI Agents’ Tool Planning & Scheduling Abilities in Compounding Tasks ACL 2026

S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models ACL 2026

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO ACL 2026

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain ACL 2026

Where the Cat Sat: A Multilingual Framework for Spatial Language Understanding ACL 2026

SMART: Evaluating LLMs’ Mathematical Reasoning via a Human Cognitive Process-Inspired Benchmark ACL 2026

Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models ACL 2026

EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation ACL 2026

Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction ACL 2026

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs ACL 2026

LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient ACL 2026

JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice ACL 2026

LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation ACL 2026

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models ACL 2026

Pub-LawBench: Public-Oriented Benchmarking for LegalAI ACL 2026

Diversity in Unity, Theory in Practice: Hierarchical Multitask Benchmarks for Chinese Minority Languages ACL 2026

Test of Time: Rethinking Temporal Signal of Benchmark Contamination ACL 2026

Probing Audio-Visual Reasoning in Multimodal Language Models through the Lens of Audio ACL 2026