conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety
ACL 2026
ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging Services
ACL 2026
FAIRGAMER: Evaluating Social Biases in LLM-Based Video Game NPCs
ACL 2026
PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation
ACL 2026
OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
ACL 2026
Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data
ACL 2026
CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation
ACL 2026
ReTRE: Benchmarking LLM Transfer Robustness with Structure-Preserving Variants
ACL 2026
Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding
ACL 2026
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
ACL 2026
INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs
ACL 2026
DAC-Bench: A Decision-Aware Benchmark for Compositional Mobile GUI Tasks
ACL 2026
DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference
ACL 2026
Benchmarking Fine-Grained Error Detection in Multimodal Reasoning
ACL 2026
When Benchmarks Leak: Inference-Time Decontamination for LLMs
ACL 2026
Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
ACL 2026
How Long Reasoning Chains Influence LLMs’ Judgment of Answer Factuality
ACL 2026
MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models
ACL 2026
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models
ACL 2026
SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation
ACL 2026
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
ACL 2026
Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
ACL 2026
Interpretable Coreference Resolution Evaluation Using Explicit Semantics
ACL 2026
SURE or Not? Investigating Semantic Understanding in Dense Retrieval Models
ACL 2026
CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models
ACL 2026
<
1
…
12
13
14
15
16
>