conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
Quantifying Metric and Model Agreement in Bias Evaluation of Large Language Models
ACL 2026
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
ACL 2026
Your Reasoning Model is Secretly a Reward Model - Optimization-Free Verification from Experience
ACL 2026
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
ACL 2026
SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing
ACL 2026
Immediate Inference: The Missing Foundation in Large Language Model Logical Reasoning
ACL 2026
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
ACL 2026
PRiSM: Benchmarking Phone Realization in Speech Models
ACL 2026
Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks
ACL 2026
Persona-Grounded Safety Evaluation of AI Companions in Multi-Turn Conversations
ACL 2026
Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts
ACL 2026
GTA: Generating Long-horizon Tasks for Web Agents at Scale
ACL 2026
Your Students Don’t Use LLMs Like You Wish They Did
ACL 2026
EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation
ACL 2026
RFS-Guard: Detecting Reasoning Hallucinations via Cross-Phase Routing Focus in Large Reasoning Models
ACL 2026
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
ACL 2026
When Bigger Isn’t Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation
ACL 2026
Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems
ACL 2026
MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation
ACL 2026
SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
ACL 2026
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
ACL 2026
RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora
ACL 2026
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
ACL 2026
ReEfBench: Quantifying the Reasoning Efficiency of LLMs
ACL 2026
Beyond Ranking: Fine-Grained Diagnostics and Self-Improvement for MLLMs
ACL 2026
<
1
…
4
5
6
…
16
>