conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation ACL 2026

ADVICE: Answer-Dependent Verbalized Confidence Estimation ACL 2026

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios ACL 2026

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies ACL 2026

JanusMM: A Benchmark for Self-Deprecation Understanding in Real-World Multimodal Conversations ACL 2026

Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models ACL 2026

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? ACL 2026

GeoLaux: A Benchmark for Evaluating MLLMs’ Geometry Performance on Long-Step Problems Requiring Auxiliary Lines ACL 2026

IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation ACL 2026

CUB: Benchmarking Context Utilisation Techniques for Language Models ACL 2026

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge ACL 2026

Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation ACL 2026

CogEvolve: A Multimodal Benchmark for Evaluating Relational Reasoning in Semantic Extension ACL 2026

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks ACL 2026

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles ACL 2026

Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation ACL 2026

TrendFact: A Benchmark Towards Hotspot Perception in Automatic Fact-Checking ACL 2026

SGVEF-LOOP: Coverage-Guided Progressive Topological Exploration and Fact-Grounded Metamorphic Evaluation for MCP Agents ACL 2026

Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models ACL 2026

Beyond Self-Report: Bridging the Intention-Behavior Gap in Critical Thinking Assessment via Interpretable Multi-Agent System ACL 2026

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items? ACL 2026

Inertia in Moral and Value Judgments of Large Language Models ACL 2026

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations ACL 2026

DR-Arena: an Automated Evaluation Framework for Deep Research Agents ACL 2026

Beyond Single View: A Comprehensive Benchmark for Medical Multimodal Large Language Models on Multi-Image Understanding ACL 2026