conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
The “Knowledge–Behavior Gap” in Cultural Taboo Safety of Large Language Models
ACL 2026
OSCBench: Benchmarking Object State Change in Text-to-Video Generation
ACL 2026
RubricBench: Aligning Model-Generated Rubrics with Human Standards
ACL 2026
RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation
ACL 2026
CogToM: A Comprehensive Theory of Mind Benchmark inspired by Human Cognition for Large Language Models
ACL 2026
CITE: Benchmarking Heterogeneous Text-Attributed Graph Models
ACL 2026
Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length
ACL 2026
GQLBench: A Large-Scale Cross-Domain, Cross-Dialect Benchmark for NL2GQL
ACL 2026
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
ACL 2026
StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
ACL 2026
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
ACL 2026
Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection
ACL 2026
Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future
ACL 2026
MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics
ACL 2026
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance
ACL 2026
Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models
ACL 2026
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
ACL 2026
PIArena: A Platform for Prompt Injection Evaluation
ACL 2026
ChemReason-Bench: Benchmarking Large Language Models for Procedural Reasoning in Experimental Chemistry
ACL 2026
Ranking Reasoning LLMs under Test-Time Scaling
ACL 2026
CloneMem: Benchmarking Long-Term Memory for AI Clones
ACL 2026
PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
ACL 2026
UniDataBench: Evaluating Data Analytics Agents Across Structured and Unstructured Data
ACL 2026
Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework
ACL 2026
MMSciCode: Real-world Evaluation of Multilingual Multi-Discipline Scientific Research Coding
ACL 2026
<
1
…
8
9
10
…
16
>