conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
ACL 2026
ADVICE: Answer-Dependent Verbalized Confidence Estimation
ACL 2026
SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios
ACL 2026
Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies
ACL 2026
JanusMM: A Benchmark for Self-Deprecation Understanding in Real-World Multimodal Conversations
ACL 2026
Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models
ACL 2026
BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?
ACL 2026
GeoLaux: A Benchmark for Evaluating MLLMs’ Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
ACL 2026
IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation
ACL 2026
CUB: Benchmarking Context Utilisation Techniques for Language Models
ACL 2026
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
ACL 2026
Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation
ACL 2026
CogEvolve: A Multimodal Benchmark for Evaluating Relational Reasoning in Semantic Extension
ACL 2026
CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks
ACL 2026
TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles
ACL 2026
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation
ACL 2026
TrendFact: A Benchmark Towards Hotspot Perception in Automatic Fact-Checking
ACL 2026
SGVEF-LOOP: Coverage-Guided Progressive Topological Exploration and Fact-Grounded Metamorphic Evaluation for MCP Agents
ACL 2026
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models
ACL 2026
Beyond Self-Report: Bridging the Intention-Behavior Gap in Critical Thinking Assessment via Interpretable Multi-Agent System
ACL 2026
Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?
ACL 2026
Inertia in Moral and Value Judgments of Large Language Models
ACL 2026
LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations
ACL 2026
DR-Arena: an Automated Evaluation Framework for Deep Research Agents
ACL 2026
Beyond Single View: A Comprehensive Benchmark for Medical Multimodal Large Language Models on Multi-Image Understanding
ACL 2026
<
1
…
6
7
8
…
16
>