conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository
ACL 2026
One Battle After Another: Probing LLMs’ Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
ACL 2026
DREAM: Deep Research Evaluation with Agentic Metrics
ACL 2026
ACIArena: Toward Unified Evaluation for Agent Cascading Injection
ACL 2026
PLAWBENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice
ACL 2026
Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives
ACL 2026
Feeling Rules in Language Models: Mapping Norms of Emotional Appropriateness Across Roles, Institutions, and Intensity
ACL 2026
Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments
ACL 2026
Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs
ACL 2026
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
ACL 2026
All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection
ACL 2026
Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores
ACL 2026
Sycophants in the Courtroom: Are LLMs Fragile to Juridical Authority and Evolving Legal Standards?
ACL 2026
Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification
ACL 2026
Limited Linguistic Diversity in Embodied AI Datasets
ACL 2026
Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring
ACL 2026
ProgressLM: Towards Progress Reasoning in Vision-Language Models
ACL 2026
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
ACL 2026
WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
ACL 2026
LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection
ACL 2026
Beyond "I Don’t Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty
ACL 2026
Beyond Noise: Characterizing Creative Potential in Unverifiable LLM Hallucinations
ACL 2026
Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models
ACL 2026
CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
ACL 2026
Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning
ACL 2026
<
1
2
3
4
5
…
16
>