conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

Benchmarking LLM’s Capability in Reasoning over Conflicting Web References ACL 2026

Aligning Language Models with Real-time Knowledge Editing ACL 2026

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark ACL 2026

EmoHarbor: Evaluating Personalized Emotional Support by Simulating the User’s Internal World ACL 2026

SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models ACL 2026

Beyond Detection: Evaluating Fallacy Awareness of LLMs in Interactive Scenarios ACL 2026

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA ACL 2026

J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization ACL 2026

A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains ACL 2026

Logic Matters in Lightweight Hallucination Classification for RAG System ACL 2026

Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness ACL 2026

SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs ACL 2026

Thinking beyond the anthropomorphic paradigm benefits LLM research ACL 2026

AwarenessBench: Assessing Cognitive Capabilities of Language Models ACL 2026

SCAN: Structured Capability Assessment and Navigation for LLMs ACL 2026

TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation ACL 2026

It’s Not What You Say, It’s How You Say It: Evaluating LLM Responses to Expressions of Belief ACL 2026

A Comprehensive Survey of Process Reward Models: Data Generation, Model Construction, and Usage ACL 2026

Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods ACL 2026

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation ACL 2026

Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency ACL 2026

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation ACL 2026

MedMCP-Calc: Benchmarking LLMs for Realistic Medical Calculator Scenarios via MCP Integration ACL 2026

PII-Bench: Evaluating Query-Aware Privacy Protection Systems ACL 2026

Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation ACL 2026