conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety ACL 2026

ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging Services ACL 2026

FAIRGAMER: Evaluating Social Biases in LLM-Based Video Game NPCs ACL 2026

PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation ACL 2026

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation ACL 2026

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data ACL 2026

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation ACL 2026

ReTRE: Benchmarking LLM Transfer Robustness with Structure-Preserving Variants ACL 2026

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding ACL 2026

AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments ACL 2026

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs ACL 2026

DAC-Bench: A Decision-Aware Benchmark for Compositional Mobile GUI Tasks ACL 2026

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference ACL 2026

Benchmarking Fine-Grained Error Detection in Multimodal Reasoning ACL 2026

When Benchmarks Leak: Inference-Time Decontamination for LLMs ACL 2026

Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text ACL 2026

How Long Reasoning Chains Influence LLMs’ Judgment of Answer Factuality ACL 2026

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models ACL 2026

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models ACL 2026

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation ACL 2026

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models ACL 2026

Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs ACL 2026

Interpretable Coreference Resolution Evaluation Using Explicit Semantics ACL 2026

SURE or Not? Investigating Semantic Understanding in Dense Retrieval Models ACL 2026

CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models ACL 2026