conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks
ACL 2026
PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows
ACL 2026
ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models
ACL 2026
MASEval: Extending Multi-Agent Evaluation from Models to Systems
ACL 2026
AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
ACL 2026
UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
ACL 2026
Paper2Web: Let’s Make Your Paper Alive!
ACL 2026
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
ACL 2026
DIAGRAMS : A Review Framework for Reasoning-Level Attribution in Diagram QA
ACL 2026
SlideGuard: AI-Driven Evaluation of Graduate Student Presentation Materials
ACL 2026
Thesis Proposal: Toward a Human-Centered and Perspective-Aware Framework for Reproducible ML Evaluation and AI Alignment
ACL 2026
What Moves the Pareto Frontier in Tool-Using Agents? A Compute-Aware Study of ReAct Variants
ACL 2026
How Hard is Math? Using Quantitative Metrics to Measure LLM Alignment to Human Intuitions of Difficulty
ACL 2026
Processing Inconsistency Predicts Language Competence: LLM Evaluation Without Answer Labels on Turkic Languages
ACL 2026
Thesis Proposal: Auditing and Mitigating Demographic Bias in Multi-Stage Retrieval Systems for Criminal Justice Applications
ACL 2026
Contextual Diversity Measure (CDM) for Controllable Story Generation in Large Language Models
ACL 2026
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
ACL 2026
Multi-Constraint State Tracking with Negation: A Diagnostic Benchmark for LLM World Modeling
ACL 2026
The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge
ACL 2026
Policy Compliance of User Requests in Natural Language for AI Systems
ACL 2026
DeepResearch Retail: Benchmarking Tool-Augmented Deep Research in the E-Commerce Domain
ACL 2026
SAJA: A Simple Approach to Judge Alignment for LLM-as-a-Judge
ACL 2026
ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making
ACL 2026
Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
ACL 2026
Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents
ACL 2026
<
1
…
12
13
14
15
16
>