conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks ACL 2026

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows ACL 2026

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models ACL 2026

MASEval: Extending Multi-Agent Evaluation from Models to Systems ACL 2026

AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge ACL 2026

UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models ACL 2026

Paper2Web: Let’s Make Your Paper Alive! ACL 2026

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents ACL 2026

DIAGRAMS : A Review Framework for Reasoning-Level Attribution in Diagram QA ACL 2026

SlideGuard: AI-Driven Evaluation of Graduate Student Presentation Materials ACL 2026

Thesis Proposal: Toward a Human-Centered and Perspective-Aware Framework for Reproducible ML Evaluation and AI Alignment ACL 2026

What Moves the Pareto Frontier in Tool-Using Agents? A Compute-Aware Study of ReAct Variants ACL 2026

How Hard is Math? Using Quantitative Metrics to Measure LLM Alignment to Human Intuitions of Difficulty ACL 2026

Processing Inconsistency Predicts Language Competence: LLM Evaluation Without Answer Labels on Turkic Languages ACL 2026

Thesis Proposal: Auditing and Mitigating Demographic Bias in Multi-Stage Retrieval Systems for Criminal Justice Applications ACL 2026

Contextual Diversity Measure (CDM) for Controllable Story Generation in Large Language Models ACL 2026

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models ACL 2026

Multi-Constraint State Tracking with Negation: A Diagnostic Benchmark for LLM World Modeling ACL 2026

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge ACL 2026

Policy Compliance of User Requests in Natural Language for AI Systems ACL 2026

DeepResearch Retail: Benchmarking Tool-Augmented Deep Research in the E-Commerce Domain ACL 2026

SAJA: A Simple Approach to Judge Alignment for LLM-as-a-Judge ACL 2026

ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making ACL 2026

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement ACL 2026

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents ACL 2026