Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
RoleAgent: Building, Interacting, and Benchmarking High-quality Role-Playing Agents from Scripts
NIPS 2024
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models
NIPS 2024
A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets
NIPS 2024
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks
COLING 2024
Meta-Evaluation of Sentence Simplification Metrics
COLING 2024
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
ACL 2024
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
NIPS 2024
Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning
NIPS 2024
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
NIPS 2024
Efficient multi-prompt evaluation of LLMs
NIPS 2024
Marathon: A Race Through the Realm of Long Context with Large Language Models
ACL 2024
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
NIPS 2024
Overcoming Common Flaws in the Evaluation of Selective Classification Systems
NIPS 2024
DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency
CVPR 2024
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
ACL 2024
∞Bench: Extending Long Context Evaluation Beyond 100K Tokens
ACL 2024
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction
ACL 2024
On the Content Bias in Frechet Video Distance
CVPR 2024
VBench: Comprehensive Benchmark Suite for Video Generative Models
CVPR 2024
Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It
CVPR 2024
Towards a Perceptual Evaluation Framework for Lighting Estimation
CVPR 2024
FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models
CVPR 2024
SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation
ACL 2024
Probing Language Models for Pre-training Data Detection
ACL 2024
LawBench: Benchmarking Legal Knowledge of Large Language Models
EMNLP 2024
<
1
…
9
10
11
…
21
>