conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning
ACL 2026
HSCodeComp: A Realistic and Expert-level Agent Benchmark for Hierarchical Rule Application
ACL 2026
SpeakerSleuth: Can Large Audio-Language Models Judge Speaker Consistency across Multi-turn Dialogues?
ACL 2026
GeoArena: Evaluating Open-World Geographic Reasoning in Large Vision-Language Models
ACL 2026
USB: A COMPREHENSIVE AND UNIFIED SAFETY EVALUATION BENCHMARK FOR MULTIMODAL LARGE LANGUAGE MODELS
ACL 2026
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
ACL 2026
AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images
ACL 2026
The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models
ACL 2026
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation
ACL 2026
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
ACL 2026
The Digital Dunning-Kruger Effect: Decoupling Hallucinations via Geometric Hidden-state Observation for Semantic Truthfulness
ACL 2026
BoYaEval: Evaluating Multimodal Large Language Models on Understanding Ancient Chinese Musical Scores
ACL 2026
Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset
ACL 2026
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage
ACL 2026
Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces
ACL 2026
BrowseComp-Plus: A Fair and Disentangled Evaluation Benchmark for Deep Search Agents
ACL 2026
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
ACL 2026
Red Teaming Large Reasoning Models
ACL 2026
Comparing human and language models sentence processing difficulties on complex structures
ACL 2026
Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese
ACL 2026
Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions
ACL 2026
Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
ACL 2026
MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
ACL 2026
Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data
ACL 2026
LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models
ACL 2026
<
1
…
5
6
7
…
16
>