conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository ACL 2026

One Battle After Another: Probing LLMs’ Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework ACL 2026

DREAM: Deep Research Evaluation with Agentic Metrics ACL 2026

ACIArena: Toward Unified Evaluation for Agent Cascading Injection ACL 2026

PLAWBENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice ACL 2026

Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives ACL 2026

Feeling Rules in Language Models: Mapping Norms of Emotional Appropriateness Across Roles, Institutions, and Intensity ACL 2026

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments ACL 2026

Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs ACL 2026

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness ACL 2026

All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection ACL 2026

Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores ACL 2026

Sycophants in the Courtroom: Are LLMs Fragile to Juridical Authority and Evolving Legal Standards? ACL 2026

Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification ACL 2026

Limited Linguistic Diversity in Embodied AI Datasets ACL 2026

Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring ACL 2026

ProgressLM: Towards Progress Reasoning in Vision-Language Models ACL 2026

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences ACL 2026

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics ACL 2026

LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection ACL 2026

Beyond "I Don’t Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty ACL 2026

Beyond Noise: Characterizing Creative Potential in Unverifiable LLM Hallucinations ACL 2026

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models ACL 2026

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents ACL 2026

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning ACL 2026