← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

RoleAgent: Building, Interacting, and Benchmarking High-quality Role-Playing Agents from Scripts NIPS 2024

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models NIPS 2024

A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets NIPS 2024

Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks COLING 2024

Meta-Evaluation of Sentence Simplification Metrics COLING 2024

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes ACL 2024

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing NIPS 2024

Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning NIPS 2024

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content NIPS 2024

Efficient multi-prompt evaluation of LLMs NIPS 2024

Marathon: A Race Through the Realm of Long Context with Large Language Models ACL 2024

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices NIPS 2024

Overcoming Common Flaws in the Evaluation of Selective Classification Systems NIPS 2024

DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency CVPR 2024

Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies ACL 2024

∞Bench: Extending Long Context Evaluation Beyond 100K Tokens ACL 2024

TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction ACL 2024

On the Content Bias in Frechet Video Distance CVPR 2024

VBench: Comprehensive Benchmark Suite for Video Generative Models CVPR 2024

Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It CVPR 2024

Towards a Perceptual Evaluation Framework for Lighting Estimation CVPR 2024

FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models CVPR 2024

SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation ACL 2024

Probing Language Models for Pre-training Data Detection ACL 2024

LawBench: Benchmarking Legal Knowledge of Large Language Models EMNLP 2024