conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
Evaluating Visual Narrative Coherence in Story Visualization via Diversified Storylines
ACL 2026
Improving the Distributional Alignment of LLMs using Supervision
ACL 2026
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
ACL 2026
DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
ACL 2026
ClaimDB: A Fact Verification Benchmark over Large Structured Data
ACL 2026
SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
ACL 2026
DRInQ: Evaluating Conversational Implicature with Controlled Context Variation
ACL 2026
TPS-Bench: Evaluating AI Agents’ Tool Planning & Scheduling Abilities in Compounding Tasks
ACL 2026
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models
ACL 2026
Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
ACL 2026
DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain
ACL 2026
Where the Cat Sat: A Multilingual Framework for Spatial Language Understanding
ACL 2026
SMART: Evaluating LLMs’ Mathematical Reasoning via a Human Cognitive Process-Inspired Benchmark
ACL 2026
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models
ACL 2026
EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
ACL 2026
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
ACL 2026
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
ACL 2026
LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient
ACL 2026
JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice
ACL 2026
LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation
ACL 2026
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
ACL 2026
Pub-LawBench: Public-Oriented Benchmarking for LegalAI
ACL 2026
Diversity in Unity, Theory in Practice: Hierarchical Multitask Benchmarks for Chinese Minority Languages
ACL 2026
Test of Time: Rethinking Temporal Signal of Benchmark Contamination
ACL 2026
Probing Audio-Visual Reasoning in Multimodal Language Models through the Lens of Audio
ACL 2026
<
1
…
9
10
11
…
16
>