conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Evaluation
393 papers
Papers per year
2021: 2
2
2022: 2
2
2023: 1
1
2024: 3
3
2025: 2
2
2026: 383
383
Papers
Benchmarking LLM’s Capability in Reasoning over Conflicting Web References
ACL 2026
Aligning Language Models with Real-time Knowledge Editing
ACL 2026
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
ACL 2026
EmoHarbor: Evaluating Personalized Emotional Support by Simulating the User’s Internal World
ACL 2026
SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models
ACL 2026
Beyond Detection: Evaluating Fallacy Awareness of LLMs in Interactive Scenarios
ACL 2026
SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA
ACL 2026
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
ACL 2026
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains
ACL 2026
Logic Matters in Lightweight Hallucination Classification for RAG System
ACL 2026
Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness
ACL 2026
SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs
ACL 2026
Thinking beyond the anthropomorphic paradigm benefits LLM research
ACL 2026
AwarenessBench: Assessing Cognitive Capabilities of Language Models
ACL 2026
SCAN: Structured Capability Assessment and Navigation for LLMs
ACL 2026
TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
ACL 2026
It’s Not What You Say, It’s How You Say It: Evaluating LLM Responses to Expressions of Belief
ACL 2026
A Comprehensive Survey of Process Reward Models: Data Generation, Model Construction, and Usage
ACL 2026
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
ACL 2026
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
ACL 2026
Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
ACL 2026
Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation
ACL 2026
MedMCP-Calc: Benchmarking LLMs for Realistic Medical Calculator Scenarios via MCP Integration
ACL 2026
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
ACL 2026
Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation
ACL 2026
<
1
2
3
4
5
…
16
>