conftrace_

← Applications

Natural Language Processing › Applications ›

Evaluation

74 papers

Papers per year

1

1

2

1

69

Papers

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics ACL 2026

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models ACL 2026

CEDAR: A Chinese Evaluation Dataset for Computational Argumentation ACL 2026

ROSE: An Intent-Centered Evaluation Metric for NL2SQL ACL 2026

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language ACL 2026

Beyond Word Boundaries: A Hebrew Coreference Benchmark and an Evaluation Protocol for Morphologically Complex Text ACL 2026

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review ACL 2026

Reward Modeling for Scientific Writing Evaluation ACL 2026

Stereotype Bias in a Bilingual Setting: A Culturally Grounded Evaluation in Kazakhstan ACL 2026

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation ACL 2026

Iterative Dual-Model Alignment for Story Evaluation ACL 2026

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization ACL 2026

Evaluating the Impact of Verbal Multiword Expressions on Machine Translation ACL 2026

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks ACL 2026

HAT: Hallucination Annotation for Translation ACL 2026

Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation ACL 2026

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs ACL 2026

Sigmoid Head for Quality Estimation under Language Ambiguity ACL 2026

Narrative License and Model Sycophancy in LLM Summaries of Scientific Work ACL 2026

Label and Explanation Variation in LLM-Based Annotation: a Case Study in Natural Language Inference ACL 2026

Putting Captions to the Test: Evaluating Video Caption Quality through Multiple-Choice Question Answering ACL 2026

Subject-level Inference for Realistic Text Anonymization Evaluation ACL 2026

Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation ACL 2026

SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks ACL 2026

LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases ACL 2026