conftrace_

Artificial Intelligence › Core AI ›

Evaluation

393 papers

Papers per year

2

2

1

3

2

383

Papers

CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents ACL 2026

Are Large Language Models Economically Viable for Industry Deployment? ACL 2026

Efficient Agent Evaluation via Diversity-Guided User Simulation ACL 2026

What Question Did You Answer? Refining Contact Center Evaluation Plans via Backward Questions ACL 2026

Development and Benchmarking of a Blended Human-AI Qualitative Research Assistant ACL 2026

Measuring and Mitigating Racial Bias in Embedding Models: A Comparative Study for Law Enforcement Retrieval ACL 2026

LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control ACL 2026

The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge MIDL 2026

Towards a Principled Evaluation of Knowledge Editors ACL 2025

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics? NAACL 2025

ToMBench: Benchmarking Theory of Mind in Large Language Models ACL 2024

BenchIE^FL: A Manually Re-Annotated Fact-Based Open Information Extraction Benchmark ACL 2024

HelloFresh: LLM Evalutions on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits ACL 2024

The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation ACL 2023

ADBench: Anomaly Detection Benchmark NIPS 2022

BenchIE: A Framework for Multi-Faceted Fact-Based Open Information Extraction Evaluation ACL 2022

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach EMNLP 2021

TabPert : An Effective Platform for Tabular Perturbation EMNLP 2021