conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
ACL 2026
Common to Whom? Regional Cultural Commonsense and LLM Bias in India
ACL 2026
Par-ITA: Benchmarking Seq2Seq and LLMs on a Human-Supervised Parallel Corpus for Italian Hyperpartisan Neutralization
ACL 2026
OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
ACL 2026
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
ACL 2026
AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs
ACL 2026
CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation
ACL 2026
Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments
ACL 2026
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
ACL 2026
HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing
ACL 2026
When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges
ACL 2026
Beyond Timestamps: Bridging Forward and Backward Reasoning in Temporal Numerical and Relational Understanding
ACL 2026
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
ACL 2026
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
ACL 2026
Responsible Evaluation of AI for Mental Health
ACL 2026
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
ACL 2026
Revisiting the Reliability of Language Models in Instruction-Following
ACL 2026
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
ACL 2026
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
ACL 2026
Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
ACL 2026
Repeated Sequences Reveal Gaps between Large Language Models and Natural Language
ACL 2026
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
ACL 2026
ReportLogic: Evaluating Logical Quality in Deep Research Reports
ACL 2026
Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective
ACL 2026
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
ACL 2026
<
1
2
3
4
5
…
16
>