conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry ACL 2026

Common to Whom? Regional Cultural Commonsense and LLM Bias in India ACL 2026

Par-ITA: Benchmarking Seq2Seq and LLMs on a Human-Supervised Parallel Corpus for Italian Hyperpartisan Neutralization ACL 2026

OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding ACL 2026

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments ACL 2026

AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs ACL 2026

CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation ACL 2026

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments ACL 2026

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents ACL 2026

HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing ACL 2026

When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges ACL 2026

Beyond Timestamps: Bridging Forward and Backward Reasoning in Temporal Numerical and Relational Understanding ACL 2026

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints ACL 2026

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts ACL 2026

Responsible Evaluation of AI for Mental Health ACL 2026

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation ACL 2026

Revisiting the Reliability of Language Models in Instruction-Following ACL 2026

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models ACL 2026

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs ACL 2026

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents ACL 2026

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language ACL 2026

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios ACL 2026

ReportLogic: Evaluating Logical Quality in Deep Research Reports ACL 2026

Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective ACL 2026

Evaluating the Expressive Appropriateness of Speech in Rich Contexts ACL 2026